pkgcore Documentation Release trunk Brian Harring, Marien Zwart, Tim Harder October 25, 2014 Contents 1 API Documentation 1.1 Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 2 Man Pages 2.1 Installed Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 7 3 Developer Notes 3.1 Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9 4 Indices and tables 55 i ii pkgcore Documentation, Release trunk Contents: Contents 1 pkgcore Documentation, Release trunk 2 Contents CHAPTER 1 API Documentation 1.1 Modules pkgcore pkgcore.binpkg pkgcore.binpkg.remote pkgcore.binpkg.repo_ops pkgcore.binpkg.repository pkgcore.binpkg.xpak pkgcore.cache pkgcore.cache.errors pkgcore.cache.flat_hash pkgcore.cache.fs_template pkgcore.cache.metadata pkgcore.config pkgcore.config.basics pkgcore.config.central pkgcore.config.cparser pkgcore.config.dhcpformat pkgcore.config.domain pkgcore.config.errors pkgcore.config.mke2fsformat pkgcore.const pkgcore.ebuild pkgcore.ebuild.atom pkgcore.ebuild.restricts pkgcore.ebuild.conditionals pkgcore.ebuild.const pkgcore.ebuild.cpv pkgcore.ebuild.digest pkgcore.ebuild.domain pkgcore.ebuild.ebd pkgcore.ebuild.ebuild_built pkgcore.ebuild.ebuild_src pkgcore.ebuild.eclass_cache pkgcore.ebuild.errors pkgcore.ebuild.filter_env pkgcore.ebuild.formatter Continued on next page 3 pkgcore Documentation, Release trunk Table 1.1 – continued from previous page pkgcore.ebuild.misc pkgcore.ebuild.portage_conf pkgcore.ebuild.processor pkgcore.ebuild.profiles pkgcore.ebuild.repo_objs pkgcore.ebuild.repository pkgcore.ebuild.resolver pkgcore.ebuild.triggers pkgcore.fetch pkgcore.fetch.base pkgcore.fetch.custom pkgcore.fetch.errors pkgcore.fs pkgcore.fs.contents pkgcore.fs.fs pkgcore.fs.livefs pkgcore.fs.ops pkgcore.fs.tar pkgcore.gpg pkgcore.log pkgcore.merge pkgcore.merge.const pkgcore.merge.engine pkgcore.merge.errors pkgcore.merge.triggers pkgcore.operations pkgcore.operations.domain pkgcore.operations.format pkgcore.operations.observer pkgcore.operations.repo pkgcore.os_data pkgcore.package pkgcore.package.base pkgcore.package.conditionals pkgcore.package.errors pkgcore.package.metadata pkgcore.package.mutated pkgcore.package.virtual pkgcore.pkgsets pkgcore.pkgsets.filelist pkgcore.pkgsets.glsa pkgcore.pkgsets.installed pkgcore.pkgsets.system pkgcore.plugin pkgcore.repository pkgcore.repository.configured pkgcore.repository.errors pkgcore.repository.misc pkgcore.repository.multiplex pkgcore.repository.prototype pkgcore.repository.syncable pkgcore.repository.util Continued on next page 4 Chapter 1. API Documentation pkgcore Documentation, Release trunk Table 1.1 – continued from previous page pkgcore.repository.virtual pkgcore.repository.visibility pkgcore.repository.wrapper pkgcore.resolver pkgcore.resolver.choice_point pkgcore.resolver.pigeonholes pkgcore.resolver.plan pkgcore.resolver.state pkgcore.resolver.util pkgcore.restrictions pkgcore.restrictions.boolean pkgcore.restrictions.delegated pkgcore.restrictions.packages pkgcore.restrictions.restriction pkgcore.restrictions.util pkgcore.restrictions.values pkgcore.scripts pkgcore.scripts.filter_env pkgcore.scripts.pclone_cache pkgcore.scripts.pconfig pkgcore.scripts.pebuild pkgcore.scripts.pinspect pkgcore.scripts.pmaint pkgcore.scripts.pmerge pkgcore.scripts.pplugincache pkgcore.scripts.pquery pkgcore.spawn pkgcore.sync pkgcore.sync.base pkgcore.sync.bzr pkgcore.sync.cvs pkgcore.sync.darcs pkgcore.sync.git pkgcore.sync.hg pkgcore.sync.rsync pkgcore.sync.svn pkgcore.system pkgcore.system.libtool pkgcore.util pkgcore.util.commandline pkgcore.util.file_type pkgcore.util.packages pkgcore.util.parserestrict pkgcore.util.repo_utils pkgcore.vdb pkgcore.vdb.contents pkgcore.vdb.ondisk pkgcore.vdb.repo_ops pkgcore.vdb.virtuals pkgcore.version 1.1. Modules 5 pkgcore Documentation, Release trunk 6 Chapter 1. API Documentation CHAPTER 2 Man Pages Pkgcore installs a set of scripts for installing/removing packages, and doing various system maintenance related operations. The man pages for each command follow. 2.1 Installed Commands 7 pkgcore Documentation, Release trunk 8 Chapter 2. Man Pages CHAPTER 3 Developer Notes These are the original docs written for pkgcore, detailing some of it’s architecture, intentions, and reasons behind certain designs. Currently, the docs aren’t accurate- this will be corrected moving forward. Right now they’re primarily useful from a background-info standpoint. 3.1 Content 3.1.1 Rough TODO • rip out use.* code from pkgcore_checks.addons.UseAddon.__init__, core.ebuild.repository and generalize it into pkg- • not hugely important, but... make a cpython version of SlottedDict from pkgcore.util.obj; 3% reduction for full repo walk, thus not a real huge concern atm. • userpriv for pebuild misbehaves.. • http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/491285 check into, probably better then my crufty itersort; need to see how well heapqu’s nlargest pop behaves (looks funky) • look into converting MULTILIB_STRICT* crap over to a trigger • install-sources trigger • recreate verify-rdepends also • observer objects for reporting back events from merging/unmerging cpython ‘tee’ is needed, contact harring for details. basic form of it is in now, but need something more powerful for parallelization elog is bound to this also • Possibly convert to cpython: – flat_hash.database._parse_data – metadata.database._parse_data – posixpath (os.path) • get the tree clean of direct /var/db/pkg access • vdb2 format (ask harring for details). 9 pkgcore Documentation, Release trunk • pkgcore.fs.ops.merge_contents; doesn’t rewrite the contents set when a file it’s merging is relying on symlinked directories for the full path; eg, /usr/share/X11/xkb/compiled -> /var/blah, it records the former instead of recording the true absolute path. • pmerge mods; [ –skip-set SET ] , [ –skip atom ], use similar restriction to –replace to prefer vdb for matching atoms • refactor pkgcore.ebuild.cpv.ver_cmp usage to avoid full cpv parsing when _cpv is in use; ‘nuff said, look in pkgcore.ebuild.cpv.cpy_ver_cmp • testing of fakeroot integration it was working back in the ebd branch days; things have changed since then (heavily), enabling/disabling should work fine, but will need to take a look at the contentset generation to ensure perms/gid leaks through correctly. • modify repository.prototype.tree.match to take an optional comparison reasoning being that if we’re just going to do a max, pass in the max so it has the option of doing the initial sorting without passing through visibility filters (which will trigger metadata lookups) • ‘app bundles’. Reliant on serious overhauling of deps to do ‘locked deps’, but think of it as rpath based app stacks, a full apache stack compiled to run from /opt/blah for example. • pkgcore.ebuild.gpgtree derivative of pkgcore.ebuild.ebuild_repository, this overloads ebuild_factory and eclass_cache so that gpg checks are done. This requires some hackery, partially dependent on config.central changes (see above). Need a way to specify the trust ring to use, ‘severity’ level (different class targets works for me). Anyone who implements this deserves massive cookies. • pkgcore.ebuild.gpgprofile: Same as above. • reintroduce locking of certain high level components using read/write; mainly, use it as a way to block sync’ing a repo that’s being used to build, lock the vdb for updates, etc. • preserve xattrs when merging files to properly support hardened profiles • support standard emerge.log output so tools such as qlop work properly • add FEATURES=parallel-fetch support for downloading distfiles in the background while building pkgs, possibly extend to support parallel downloads • apply repo masks to related binpkgs (or handle masks somehow) • remove deprecated PROVIDE and old style virtuals handling • add argparse support for checking the inputted phase name with pebuild to make sure it exists, currently nonexistent input cause unhandled exceptions • allow pebuild to be passed ebuild file paths in addition to its current atom handling, this should work similar to how portage’s ebuild command operates • support repos.conf (SYNC is now deprecated) • make profile defaults (LDFLAGS) override global settings from /usr/share/portage/config/make.globals or similar then apply user settings on top, currently LDFLAGS is explicitly set to an empty string in make.globals but the profile settings aren’t overriding that • support /etc/portage/mirrors • support ACCEPT_PROPERTIES and /etc/portage/package.properties • support ACCEPT_RESTRICT and /etc/portage/package.accept_restrict • support pmerge –info (emerge –info workalike), requires support for info_vars and info_pkgs files from profiles 10 Chapter 3. Developer Notes pkgcore Documentation, Release trunk 3.1.2 Changes (Note that this is not a complete list) • Proper env saving/reloading. The ebuild is sourced once, and run from the env. • DISTDIR has indirection now. It points at a directory, ie, symlinks. to the files. The reason for this is to prevent builds from lying about their sources, leading to less bugs. • PORTAGE_TMPDIR is no longer in the ebuild env. • (PORTAGE_|)BUILDDIR is no longer in the ebuild env. • BUILDPREFIX is no longer in the ebuild env. • AA is no longer in the ebuild env. • inherit is an error in phases except for setup, prerm, and postrm. pre/post rm are allowed only in order to deal with broken envs. Running config with a broken env isn’t allowed, because config won’t work; installing with a broken env is not allowed because preinst/postinst won’t be executed. • binpkg building now gets the unmodified contents- thus when merging a binpkg, all files are there unmodified. 3.1.3 Commandline framework Overview pkgcore’s own commandline tools and ideally also most external tools use a couple of utilities from pkgcore.util.commandline to enforce a consistent interface and reduce boilerplate. There are also some helpers for writing tests for scripts using the utilities. Finally, pkgcore’s own scripts are started through a single wrapper (just to reduce boilerplate). Writing a script Whether your script is intended for inclusion with pkgcore itself or not, the first things you should write are a commandline.OptionParser subclass (unless your script takes no commandline arguments) and a main function. The OptionParser is a lightly customized optparse.OptionParser, so the standard optparse documentation applies. Differences include: • A couple of standard options and defaults are added. Some of this uses __init__.py, so if you override that (which you will) remember to call the base class (with any keyword arguments you received). • The “Values” object used is a subclass, with a “config” property that autoloads the user’s configuration. You should access this as late as possible for a more responsive ui. • check_values applies some minor cleanups, see the module for details. Remember to call the base method (you will usually want to do some things here). The “main” function takes an optparse “values” object generated by your option parser and two pkgcore.util.formatters.Formatter instances, one for stdout and one for stderr. This one should do the actual work your script does. The return value of the main function is your script’s exit status. Returning None is the same thing as returning 0 (success). If you have used optparse before you might wonder why main only receives an optparse values object, not the remaining arguments. This is handled a bit differently in pkgcore: if you handle arguments you should sanity-check them in check_values and store them on the values object. check_values should always return an empty tuple as second 3.1. Content 11 pkgcore Documentation, Release trunk argument, either because no arguments were passed or because they were all accepted by check_values. We believe this makes more sense, since it stores everything learned from the commandline on a single object. All output has to go through the formatter. If you use “print” directly the formatter will lose track of where it is in the line, which will cause weird output if you use the “wrap” option of the formatter. The test helpers also rely on all output going through the formatters. To actually run your script you call pkgcore.util.commandline.main (do not confuse this with your own script’s main function, the two are quite different). The simplest (and most common) call is commandline.main({None: (yourscript.OptionParser, yourscript.main)}). The weird dict is used for subcommands. The recommended place to put this call is in a tiny script that just imports your actual script module and calls commandline.main. Making your script an actual module you can import means it can be tested (and it can be useful in interactive python or for quick hacky scripts). commandline.main takes care of a couple of things, including setting up a reporter for the standard library’s logging package and swallowing exceptions from the configuration system. It does not swallow any other exceptions your script might raise (although this might become an option in the future). check_values and main: what goes where The idea (as you can guess from the names) is that check_values makes sure everything passed on the commandline makes sense, but no more than that. • The best way to report incorrect commandline parameters is by calling error("error message goes here") on the option parser. You cannot do this from main, since it has no access to the option parser. Please do not try to print something similar through the err formatter here, shift the code to check_values. • check_values does not have access to the out or err formatter. The only way it should “communicate” is through the error (or possibly exit) methods. If you want to produce different kinds of output, do it in main. (it is possible the option parser will grow a warning method at some point, if this would be useful let us know (file a trac ticket). • Use common sense. If it is part of your script’s main task it should be in main. If it changes the filesystem it should definitely be in main. Subcommands The main function recently gained some support for subcommands (which you probably know from most version control systems). If you find yourself trying to reimplement this kind of interface with optparse, or one similar to emerge with a couple of mutually exclusive switches selecting a mode (–depclean, –sync etc.) then you should try using this subcommand system instead. To use it, simply define a separate OptionParser and main function for every subcommand and use the subcommand name as the key in the dict passed to commandline.main. The key None used for “no subcommand” can still be used too, but this is probably not a good idea. If there is no parser/main pair with the key None and an unrecognized subcommand is passed (including --help) an overview of subcommands is printed. This uses the docstring of the __main__ function, so put something useful there. If there is a None parser you should include the valid subcommands in its --help output, since there is no way to get at commandline.main’s autogenerated subcommand help if a None parser is present. pwrapper Because having a dozen of different scripts each just calling commandline.main would be silly pkgcore’s own scripts are all symlinks to a single wrapper which imports the right actual script based on the sys.argv[0] it is called with 12 Chapter 3. Developer Notes pkgcore Documentation, Release trunk and runs it. The script module needs to define either a commandline_commands dict (for a script with subcommands) or a class called OptionParser and function called main for this to work. The script used in the source tree also takes care of inserting the right pkgcore package on sys.path. Installed pkgcore uses a different wrapper without this magic. If you write a new script that should go into pkgcore itself, use the wrapper. If you maintain it externally and do not have a lot of scripts, don’t bother duplicating this wrapper system. Don’t bother duplicating the path manipulation either: if you put your script in the same directory your actual package or module is in (no separate “bin” directory) and don’t run it as root no path manipulation is required. Tests Because additions to the default options pkgcore uses can make your script unrunnable it is critical to have at least rudimentary tests that just instantiate your parser. Because optparse defaults to calling sys.exit for a parse failure and the pkgcore version also likes to load the user’s configuration files, writing those tests is slightly tricky. pkgcore.test.scripts.helpers tries to make it easier. It contains a mangle_parser function that takes an OptionParser instance and makes it raise exceptions instead of exiting. It also contains a mixin with some extra assert methods that check if your option parser and main function have the desired effect on various arguments and configurations. See the docstrings for more information. 3.1.4 Config use and implementation notes Using the manager Normal use To get at the user’s configuration: from pkgcore.config import load_config config = load_config() main_repo = config.get_default(’repo’) spork_repo = config.repo[’spork’] Usually this is everything you need to know about the manager. Some things to be aware of: • Some of the managed sources of configuration data may be slow, so accessing a source is delayed for as long as possible. Some things require accessing all sources though and should therefore be avoided. The easiest one to trigger is config.repo.keys() or the equivalent list(config.sections(‘repo’)). This has to get the “class” setting for every available config section, which might be slow. • For the same reason the manager does not know what type names exist (there is no hardcoded list of them, so the only way to get that information would be loading all config sections). This is why you can get this: >>> load_config().section(’repo’) # typo, should be "sections" Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: ’_ConfigMapping’ object is not callable This constructed a dictlike object for accessing all config sections of the type “section”, then tried to call it. Testcase use For testing of high-level scripts it can be useful to construct a config manager containing hardcoded values: 3.1. Content 13 pkgcore Documentation, Release trunk from pkgcore.config import basics, central config = central.ConfigManager([{ ’repo’ = basics.HardCodedConfigSection({’class’: my_repo, ’data’: [’1’, ’2’]}), ’cont’ = basics.ConfigSectionFromStringDict({’class’: ’pkgcore.my.cont’, ’ref’: ’repo’}), }]) What this does should be fairly obvious. Be careful you do not use the same ConfigSection object in more than one place: caching will not behave the way you want. See Adding a config source for details. Adding a configurable You often do not really have to do anything to make something a valid “class” value, but it is clearer and it is necessary in certain cases. Adding a class To make a class available, do this: from pkgcore.config import ConfigHint, errors class MyRepo(object): pkgcore_config_type = ConfigHint({’cache’: ’section_ref’}, typename=’repo’) def __init__(self, repo): try: self.initialize(repo) except SomeRandomException: raise errors.InstantiationError(’eep!’) The first ConfigHint arg tells the config system what kind of arguments you take. Without it it assumes arguments with no default are strings and guesses for the other args based on the type of the default value. So if you have no default values or they are just None you should tell the system about your args. The second one tells it you fulfill the repo “protocol”, meaning your instances will show up in load_config().repo. ConfigHint takes some more arguments, see the api docs for details. Adding a callable To make a callable available you can do this: from pkgcore.config import configurable, errors @configurable({’cache’: ’section_ref’}, typename=repo) def my_repo(repo): # do stuff configurable is just a convenience function that applies a ConfigHint. 14 Chapter 3. Developer Notes pkgcore Documentation, Release trunk Exception handling If you raise an exception when the config system calls you it will catch the exception and wrap it in an InstantiationError. This is good for calling code since catching and printing those provides the user with a readable description of what happened. It is less good for developers since the raising of a new exception kills the traceback printed in debug mode. You will have a traceback that “ends” in the config code handling instantiation. You can improve this by raising an InstantiationError yourself. If you do this the config system will be able to add the extra information needed for a user-friendly error message to it without raising a new exception, meaning debug mode will give a traceback leading right back to your code raising the InstantiationError. Adding a config source Config sources are pretty straightforward: they are mappings from a section name to a ConfigSection subclass. The only tricky thing is the combination of section references and caching. The general rule is “do not expose the same ConfigSection in more than one way”. If you do it will be collapsed and instantiated once for every way it is exposed, which is usually not what you want. An example: from pkgcore.config import basics, configurable def example(): return object() @configurable({’ref’: ’section_ref’}) def nested(ref): return ref multi = basics.HardCodedConfigSection({’class’: example}) myconf = { ’multi’: multi, ’bad’: basics.HardCodedConfigSection({’class’: nested, ’ref’: multi}) ’good’: basics.ConfigSectionFromStringDict({’class’: ’nested’, ’ref’: ’multi’}) If you feed this to the ConfigManager and instantiate everything “multi” and “good” will be identical but “bad” will be a different object. For an explanation of why this happens see the implementation notes in the next section. You trigger a similar problem if you create a custom ConfigSection subclass that bypasses central’s collapse_named_section for named section refs. If you somehow get at the referenced ConfigSection and hand it to collapse_section you will most likely circumvent caching. Only use collapse_section for unnamed sections. ConfigManager tries not to extract more things from this mapping than it has to. Specifically, it will not call __getitem__ before it needs to instantiate the section or needs to know its type. However it will iterate over the keys (section names) immediately to find autoloads. If this is a problem (getting those names is slow) then make sure the manager knows your config is “remote”. Implementation notes This code has evolved quite a bit over time. The current code/design tries among other things to: • Allow sections to contain both named and nameless/inline references to other sections. • Allow serialization of the loaded config. • Not do unnecessary work (if possibly not recollapse configs, definitely not trigger unnecessary imports, access configs unnecessarily, reinstantiate configs) 3.1. Content 15 pkgcore Documentation, Release trunk • Provide both end-user error messages that are complete enough to track down a problem in a complex nested config and tracebacks that reach back to actual buggy code for developers. Overview from load_config() to instantiated repo When you call load_config() it looks up what config files are available (/etc/pkgcore.conf, ~/.pkgcore.conf, /etc/make.conf) and loads them. This produces a dict mapping section names to ConfigSection instances. For the ini-format pkgcore.conf files this is straightforward, for make.conf this is a lot of work done in pkgcore.config.portage_conf. I’m not going to describe that module here, read the source for details. The ConfigSections have a pretty straightforward api: they work like dicts but get passed a string describing what “type” the value should be and a central.ConfigManager instance for reasons described later. Passing in this “type” string when getting the value is necessary because the way things like lists of strings are stored depends on the format of the configuration file but the parser does not have enough information to know it should parse as a list instead of a string. For example, an ini-format pkgcore.conf could contain: [my-overlay-cache] class=pkgcore.cache.flat_hash.database auxdbkeys=DEPEND RDEPEND We want to turn that auxdbkeys value into a list of strings in the ini file parser code instead of in the central.ConfigManager or even higher up because more exotic config sections may want to store this in a different way (perhaps as a comma-separated list, or even as “<el>DEPEND</el><el>RDEPEND</el>”. But there is obviously not enough information in the ini file for the parser to know this is meant as a list instead of a string with a space in it. central.ConfigManager gets instantiated with one or more of those dicts mapping section names to ConfigSections. They’re split up into normal and “remote” configs which I’ll describe later, let’s assume they’re all “remote” for now. In that case no work is done when the ConfigManager is instantiated. Getting an actual configured object out of the ConfigManager is split in two phases. First the involved config sections are “collapsed”: inherits are processed, values are converted to the right type, presence of required arguments is checked, etc. Everything up to actually instantiating the target class and actually instantiating any section references it needs. The result of this work is bundled in a CollapsedConfig instance. Actual instantiation is handled by the CollapsedConfig instance. The ConfigManager manages CollapsedConfig instances. It creates new ones if required and makes sure that if a cached instance is available it is used. For the remainder of the example let’s assume our config looks like this: [spork] inherit=cache auxdbkeys=DEPEND RDEPEND [cache] class=pkgcore.cache.flat_hash.database Running config.repo[’spork’] runs config.collapse_named_section(‘spork’). This first checks if this section was already collapsed and returns the CollapsedConfig if it is available. If it is not in the cache it looks up the ConfigSection with that name in the dicts handed to the ConfigManager on instantiation and calls collapse_section on it. collapse_section first recursively finds any inherited sections (just the “cache” section in this case). It then grabs the ‘class’ setting (which is always of type ‘callable’). In this case that’s “pkgcore.cache.flat_hash.database”, which the ConfigSection imports and returns. This is then wrapped in a config.basics.ConfigType. A ConfigType contains the information necessary to validate arguments passed to the callable. It uses the magic pkgcore_config_type attribute if the callable has it and introspection for everything else. In this case pkgcore.cache.flat_hash.database.pkgcore_config_type is a ConfigHint stating the “auxdbkeys” argument is of type “list”. 16 Chapter 3. Developer Notes pkgcore Documentation, Release trunk Now that collapse_section has a ConfigType it uses it to retrieve the arguments from the ConfigSections and passes the ConfigType and arguments to CollapsedConfig’s __init__. Then it returns the CollapsedConfig instance to collapse_named_section. collapse_named_section caches it and returns it. Now we’re back in the __getattr__ triggered by config.repo[’spork’]. This checks if the ConfigType on the CollapsedConfig is actually ‘repo’, and returns collapsedConfig.instantiate() if this matches. Lazy section references The main reason the above is so complicated is to support various kinds of references to other sections. Example: [spork] class=pkgcore.Spork ref=foon [foon] class=pkgcore.Foon Let’s say pkgcore.Spork has a ConfigHint stating the type of the “ref” argument is “lazy_ref:foon” (lazy reference to a foon) and its typename is “repo”, and pkgcore.Foon has a ConfigHint stating its typename is “foon”. a “lazy reference” is an instance of basics.LazySectionRef, which is an object containing just enough information to produce a CollapsedConfig instance. This is not the most common kind of reference, but it is simpler from the config point of view so I’m describing this one first. When collapse_section runs on the “spork” section it calls section.get_value(self, ‘ref:repo’, ‘section_ref’). “lazy_ref” in the type hint is converted to just “ref” here because the ConfigSections do not have to distinguish between lazy and “normal” references. Because this particular ConfigSection only supports named references it returns a LazyNamedSectionRef(central, ‘ref:repo’, ‘foon’). This just gets handed to Spork’s __init__. If the Spork decides to call instantiate() on the LazyNamedSectionRef it calls central.collapse_named_section(‘foon’), checks if the result is of type foon, instantiates it and returns it. The same thing using a dhcp-style config: spork { class pkgcore.Spork; ref { class pkgcore.Foon; }; } In this format the reference is an inline unnamed section. When get_value(central, ‘ref:repo’, ‘foon’) is called it returns a LazyUnnamedSectionRef(central, ‘ref:repo’, section) where section is a ConfigSection instance for the nested section (knowing just that “class” is “pkgcore.Foon” in this case). This is handed to Spork.__init__. If Spork calls instantiate() on it it calls central.collapse_section(self.section) and does the same type checking and instantiating LazyNamedSectionRef did. Notice neither Spork nor ConfigManager care if the reference is inline or named. get_value just has to return a LazySectionRef instance (LazyUnnamedSectionRef and LazyNamedSectionRef are subclasses of this). How this actually gets a referenced config section is up to the ConfigSection whose get_value gets called. Normal section references If Spork’s ConfigHint defines the type of its “ref” argument as “ref:foon” instead of “lazy_ref:foon” it gets handed an actual Foon instance instead of a LazySectionRef to one. This is built on top of the lazy reference code. For the ConfigSections nothing changes (the same get_value call is made). But the ConfigManager now immediately 3.1. Content 17 pkgcore Documentation, Release trunk calls collapse() on the LazySectionRef, retrieving a CollapsedConfig instance (for the “foon”). This is handed to the CollapsedConfig for “spork”, and when this one is instantiated the referenced CollapsedConfig is also instantiated. Miscellaneous details The support for nameless sections means neither ConfigSection nor CollapsedConfig have a name attribute. This makes the error handling code a bit tricky as it has to tag in the name at various points, but this works better than enforcing names where it does not make sense (means lots of unnecessary duplication of names when dealing with dicts of HardCoded/StringBasedConfigSections). The suppport for serialization of the loaded config means section_refs cannot be instantiated straight away. The object used for serialization is the CollapsedConfig which gives you both the actual values and the type they have. If the CollapsedConfig contained arbitrary instantiated objects serializing them would be impossible. So it contains nested CollapsedConfigs instead. Not doing unnecessary work is done by caching in two places. The simple one is CollapsedConfig caching its instantiated value. This is pretty straightforward. The more subtle one is ConfigManager caching CollapsedConfigs by name. It is obviously a good idea to cache these (if we didn’t we would have to cache the instantiated value in the ConfigManager). An alternative would be caching them by their ConfigSection. This has the minor disadvantage of keeping the ConfigSection in memory, and the larger one that it may break caching for weird config sources that generate ConfigSections on demand. The downside of caching by name is we have to make sure nothing generates a CollapsedConfig for a named section in a way other than collapse_named_section (handing the ConfigSection to collapse_section bypasses caching). This means a ConfigSection cannot return a raw ConfigSection from a section_ref get_value call. If it was a ConfigSection that central then collapsed and the reference was actually to a named section caching is bypassed. The need for a section name starting with “autoload” is also there to avoid unnecessary work. Without this we would have to figure out the typename of every section. While we can do that without entirely collapsing the config we cannot avoid importing the “class”, which means load_config() would import most of pkgcore. That should definitely be avoided. 3.1.5 Checking the source out If you’re just installing pkgcore from a released tarball, skip this section. To get the current (development) code with history, install git_ (emerge git on gentoo) and run: git clone git://pkgcore.org/pkgcore 3.1.6 Installing pkgcore Set PYTHONPATH If you only want to run scripts from pkgcore itself (the ones in its “bin” directory) you do not have to do anything with PYTHONPATH. If you want to use pkgcore from an interactive python interpreter session you do not have to do anything if you start the interpreter from the “root” of the pkgcore source tree. For other uses you probably want to set PYTHONPATH to include your pkgcore directory, so that python can find the pkgcore code. For example: $ export PYTHONPATH="${PYTHONPATH}:/home/user/pkgcore/" Now test to see if it works: 18 Chapter 3. Developer Notes pkgcore Documentation, Release trunk $ python -c ’import pkgcore’ Python will scan pkgcore, see the pkgcore directory in it (and that it has __init__.py), and use that. Registering plugins Pkgcore uses plugins for some basic functionality. You do not really have to do anything to get this working, but things are a bit faster if the plugin cache is up to date. This happens automatically if the cache is stale and the user running pkgcore may write there, but if pkgcore is installed somewhere system-wide and you only run it as user you can force a regeneration with: # pplugincache If you want to update plugin caches for something other than pkgcore’s core plugin registry, pass the package name as an argument. Test pkgcore Drop back to normal user, and try: $ python >>> import pkgcore.config >>> from pkgcore.ebuild.atom import atom >>> conf=pkgcore.config.load_config() >>> tree=conf.get_default(’domain’).repos[1] >>> pkg=max(tree.itermatch(atom("dev-util/diffball"))) >>> print pkg >>> print pkg.depends >=dev-libs/openssl-0.9.6j >=sys-libs/zlib-1.1.4 >=app-arch/bzip2-1.0.2 At the time of writing the domain interface is in flux, so this example might fail for you. If it doesn’t work ask for assistance in #pkgcore on freenode, or email ferringb (at) gmail.com’ with the traceback. Build extensions If you want to run pkgcore from its source directory but also want the extra speed from the compiled extension modules, compile them in place: $ python setup.py build_ext -i 3.1.7 Ebuild EAPI This should hold the proposed (with a chance of making it in), accepted, and implemented changes for ebuild format version 1. A version 0 doc would also be a good idea ( no one has volunteered thus far ). Version 0 (or undefined eapi, <=portage-2.0.52*)] Version 1 This should be fairly easy stuff to implement for the package manager, so this can actually happen in a fairly short timeframe. • EAPI = 1 required 3.1. Content 19 pkgcore Documentation, Release trunk • src_configure phase is run before src_compile. If the ebuild or eclass does not override there is a default that does nothing. Things like econf should be run in this phase, allowing rerunning the build phase without rerunning configure during development. • Make the default implementation of phases/functions available under a second name (possibly using EXPORT_FUNCTIONS) so you can call base_src_compile from your src_compile. • default src_install. Exactly what goes in needs to be figured out, see bug 33544. • RDEPEND=”${RDEPEND-${DEPEND}}” is no longer set by portage, same for eclass. • (proposed) BDEPEND metadata addition, maybe. These are the dependencies that are run on the build system (toolchain, autotools etc). Useful for ROOT != “/”. Probably hard to get right for ebuild devs who always have ROOT=”/”. • default IUSE support, IUSE=”+gcj” == USE=”gcj” unless the user disables it. • GLEP 37 (“Virtuals Deprecation”), maybe. The glep is “deferred”. How much of this actually needs to be done? package.preferred? • test depend, test src_uri (or represent test in the use namespace somehow). TEST_{SRC_URI,{B,R,}DEPEND}, test “USE” flag getting set by FEATURES=test. Possibilities: • drop AA (unused). • represent in metadata if the pkg needs pkg_preinst to have access to ${D} or not. If this is not required a binpkg can be unpacked straight to root after pkg_preinst. If pkg_preinst needs access to ${D} the binpkg is unpacked there as usual. • use groups in some form (kill use_expand off). • ebuilds can no longer use PORTDIR and ECLASSDIR(s); they break any potential remote, and are dodgey as all hell for multiple repos combined together. • disallow direct access to /var/db/pkg • deprecate ebuild access/awareness of PORTAGE_* vars; perl ebuilds security fix for PORTAGE_TMPDIR (rpath stripping in a way) might make this harder. • use/slot deps, optionally repository deps. • hard one to slide in, but change versioning rules; no longer allow 1.006, require it to be 1.6 • pkg_setup must be sandboxable. • allowed USE conditional configurations; new metadata key, extend depset syntax to include xor, represent allowed configurations. • true incremental stacking support for metadata keys between eclasses/ebuilds; RESTRICT=-strip for example in the ebuild. • drop -* from keywords; it’s package.masking, use that instead (-arch is acceptable although daft) • blockers aren’t allowed in PDEPEND (the result of that is serious insanity for resolving) Version 1+ Not sure about these. Maybe some can go into version 1, maybe they will happen later. • Elibs • some way to ‘bind’ a rdep/pdep so that it’s explicit “I’m locked against the version I was compiled against” • some form of optional metadata specifying that a binpkg works on multiple arches, iow it doesn’t rely on compiled components. 20 Chapter 3. Developer Notes pkgcore Documentation, Release trunk • A way to move svn/cvs/etc source fetching over to the package manager. The current way of doing this through an eclass is a bit ugly since it requires write access to the distdir. Moving it to the package manager fixes that and allows integrating it with things like parallel fetch. This needs to be fleshed out. 3.1.8 Feature (FEATURES) categories relevant list of features • autoaddcvs • buildpkg • ccache • collision-protect • confcache • cvs • digest • distcc • distlocks • fixpackages • getbinpkg • gpg • keeptemp • keepwork • mirror • noclean (keeptemp, keepwork) • nodoc • noinfo • noman • nostrip • notitles • sandbox • severe • severer (dumb spanky) • sfperms • sign • strict • suidctl • test • userpriv 3.1. Content 21 pkgcore Documentation, Release trunk • userpriv_fakeroot • usersandbox Undefined fixpackages Dead • usersandbox • noclean • getbinpkg (it’s a repo type, not a global feature) • buildpkg (again, repo thing. moreso ui/buildplan execution) Build • keeptemp, keepwork, noclean, ccache, distcc • sandbox, userpriv, fakeroot • userpriv_fakeroot becomes fakeroot • confcache • noauto (fun one) • test repos or wrappers Mutables • autoaddcvs • cvs • digest • gpg • no{doc,info,man,strip} • sign • sfperms • collision-protect (vdb only) Immutables • strict • severe ; these two are repository opts on gpg repo class 22 Chapter 3. Developer Notes pkgcore Documentation, Release trunk Fetchers • distlocks, sort of. 3.1.9 Filesystem Operations Here we define types of operations that pkgcore will support, as well as the stages where these operations occur. - File Deletion ( Removal ) • prerm • unmerge files • postrm - File Addition ( Installation ) • preinst • merge files • postinst - File Replacement ( Overwriting ) • preinst • merge • postinst • prerm • unmerge • postrm 3.1.10 Python Code Guidelines Note that not all of the existing code follows this style guide. This doesn’t mean existing code is correct. Stats are all from a sempron 1.6Ghz with python 2.4.2. Finally, code _should_ be documented, following epydoc/epytext guidelines Follow pep8, with following exemptions • <80 char limit is only applicable where it doesn’t make the logic ugly. This is not an excuse to have a 200 char if statement (fix your logic). Use common sense. • Combining imports is ok. • Use absolute imports • _Simple_ try/except combined lines are acceptable, but not forced (this is your call). example: 3.1. Content 23 pkgcore Documentation, Release trunk try: l.remove(blah) except IndexError: pass • For comments, 2 spaces trailing is pointless- not needed. • Classes should be named SomeClass, functions/methods should be named some_func. • Exceptions are classes. Don’t raise strings. • Avoid __var ‘private’ attributes unless you absolutely have a reason to hide it, and the class won’t be inherited (or that attribute must _not_ be accessed) • Using string module functions when you could use a string method is evil. Don’t do it. • Use isinstance(str_instance, basestring) unless you _really_ need to know if it’s utf8/ascii Throw self with a NotImplementedError The reason for this is simple: if you just throw a NotImplementedError, you can’t tell how the path was hit if derivative classes are involved; thus throw NotImplementedError(self, string_name_of_attr) This gives far better tracebacks. Be aware of what the interpreter is actually doing Don’t use len(list_instance) when you just want to know if it’s nonempty/empty: l=[1] if l: blah # instead of if len(l): blah python looks for __nonzero__, then __len__. It’s far faster than if you try to be explicit there: python -m timeit -s ’l=[]’ ’if len(l) > 0: pass’ 1000000 loops, best of 3: 0.705 usec per loop python -m timeit -s ’l=[]’ ’if len(l): pass’ 1000000 loops, best of 3: 0.689 usec per loop python -m timeit -s ’l=[]’ ’if l: pass’ 1000000 loops, best of 3: 0.302 usec per loop Don’t explicitly use has_key. Rely on the ‘in’ operator python -m ’timeit’ -s ’d=dict(zip(range(1000), range(1000)))’ ’d.has_key(1999999)’ 1000000 loops, best of 3: 0.512 usec per loop python -m ’timeit’ -s ’d=dict(zip(range(1000), range(1000)))’ ’1999999 in d’ 1000000 loops, best of 3: 0.279 usec per loop Python interprets the ‘in’ command by using __contains__ on the instance. The interpreter is faster at doing getattr’s than actual python code is: for example, the code above uses d.__contains__ , if you do d.has_key or d.__contains__, it’s the same speed. Using ‘in’ is faster because it has the interpreter do the lookup. So be aware of how the interpreter will execute that code. Python code specified attribute access is slower then the interpreter doing it on its own. 24 Chapter 3. Developer Notes pkgcore Documentation, Release trunk If you’re in doubt, python -m timeit is your friend. ;-) Do not use [] or {} as default args in function/method definitions >>> def f(default=[]): >>> default.append(1) >>> return default >>> print f() [1] >>> print f() [1,1] When the function/class/method is defined, the default args are instantiated _then_, not per call. The end result of this is that if it’s a mutable default arg, you should use None and test for it being None; this is exempted if you _know_ the code doesn’t mangle the default. Visible curried functions should have documentation When using the currying methods (pkgcore.util.currying) for function mangling, preserve the documentation via pretty_docs. If this is exempted, pydoc output for objects isn’t incredibly useful. Unit testing All code _should_ have test case functionality. We use twisted.trial - you should be running >=2.2 (<2.2 results in false positives in the spawn tests). Regressions should be test cased, exempting idiot mistakes (e.g, typos). We are more than willing to look at code that lacks tests, but actually merging the code to integration requires that it has tests. One area that is (at the moment) exempted from this is the ebuild interaction; testing that interface is extremely hard, although it _does_ need to be implemented. If tests are missing from code (due to tests not being written initially), new tests are always desired. If it’s FS related code, it’s _usually_ cheaper to try then to ask then try ...but you should verify it ;) existing file (but empty to avoid reading overhead): echo > dar python -m ’timeit’ -s ’import os’ ’os.path.exists("dar") and open("dar").read()’ 10000 loops, best of 3: 36.4 usec per loop python -m ’timeit’ -s ’import os’ $’try:open("dar").read()\nexcept IOError: pass’ 10000 loops, best of 3: 22 usec per loop nonexistant file: rm foo python -m ’timeit’ -s ’import os’ ’os.path.exists("foo") and open("foo").read()’ 10000 loops, best of 3: 29.8 usec per loop 3.1. Content 25 pkgcore Documentation, Release trunk python -m ’timeit’ -s ’import os’ $’try:open("foo").read()\nexcept IOError: pass’ 10000 loops, best of 3: 27.7 usec per loop As you can see, there is a bit of a difference. :) Note that this was qualified with “If it’s FS related code”; syscalls are not cheap- if it’s not triggering syscalls, the next section is relevant. Catching Exceptions in python code (rather then cpython) isn’t cheap stats from python-2.4.2 When an exception is caught: python -m ’timeit’ -s ’d=dict(zip(range(1000), range(1000)))’ $’try: d[1999]\nexcept KeyError: pass’ 100000 loops, best of 3: 8.7 usec per loop python -m ’timeit’ -s ’d=dict(zip(range(1000), range(1000)))’ $’1999 in d and d[1999]’ 1000000 loops, best of 3: 0.492 usec per loop When no exception is caught, overhead of try/except setup: python -m ’timeit’ -s ’d=dict(zip(range(1000), range(1000)))’ $’try: d[0]\nexcept KeyError: pass’ 1000000 loops, best of 3: 0.532 usec per loop python -m ’timeit’ -s ’d=dict(zip(range(1000), range(1000)))’ $’d[0]’ 1000000 loops, best of 3: 0.407 usec per loop This doesn’t advocate writing code that doesn’t protect itself- just be aware of what the code is actually doing, and be aware that exceptions in python code are costly due to the machinery involved. Another example is when to use or not to use dict’s setdefault or get methods: key exists: # Through exception handling python -m timeit -s ’d=dict.fromkeys(range(100))’ ’try: x=d[1]’ ’except KeyError: x=42’ 1000000 loops, best of 3: 0.548 usec per loop # d.get python -m timeit -s ’d=dict.fromkeys(range(100))’ ’x=d.get(1, 42)’ 1000000 loops, best of 3: 1.01 usec per loop key doesn’t exist: # Through exception handling python -m timeit -s ’d=dict.fromkeys(range(100))’ ’try: x=d[101]’ ’except KeyError: x=42’ 100000 loops, best of 3: 8.8 usec per loop # d.get python -m timeit -s ’d=dict.fromkeys(range(100))’ ’x=d.get(101, 42)’ 1000000 loops, best of 3: 1.05 usec per loop The short version of this is: if you know the key is there, dict.get() is slower. If you don’t, get is your friend. In other words, use it instead of doing a containment test and then accessing the key. Of course this only considers the case where the default value is simple. If it’s something more costly “except” will do relatively better since it’s not constructing the default value if it’s not needed. So if in doubt and in a performancecritical piece of code: benchmark parts of it with timeit instead of assuming “exceptions are slow” or “[] is fast”. 26 Chapter 3. Developer Notes pkgcore Documentation, Release trunk cpython ‘leaks’ vars into local namespace for certain constructs def f(s): while True: try: some_func_that_throws_exception() except Exception, e: # e exists in this namespace now. pass # some other code here... From the code above, e bled into the f namespace- that’s referenced memory that isn’t used, and will linger until the while loop exits. Python _does_ bleed variables into the local namespace- be aware of this, and explicitly delete references you don’t need when dealing in large objs, especially dealing with exceptions: class c: d = {} for x in range(1000): d[x] = x While the class above is contrived, the thing to note is that c.x is now valid- the x from the for loop bleeds into the class namespace and stays put. Don’t leave uneeded vars lingering in class namespace. Variables that leak from for loops _normally_ aren’t an issue, just be aware it does occur- especially if the var is referencing a large object (thus keeping it in memory). So... for loops leak, list comps leak, dependent on your except clause they can also leak. Do not go overboard with this though. If your function will exit soon do not bother cleaning up variables by hand. If the “leaking” things are small do not bother either. The current code deletes exception instances explicitly much more often than it should since this was believed to clean up the traceback object. This does not work: the only thing “del e” frees up is the exception instance and the arguments passed to its constructor. “del e” also takes a small amount of time to run (clearing up all locals when the function exits is faster). Unless you need to generate (and save) a range result, use xrange :: python -m timeit ‘for x in range(10000): pass’ 100 loops, best of 3: 2.01 msec per loop $ python -m timeit ‘for x in xrange(10000): pass’ 1000 loops, best of 3: 1.69 msec per loop Removals from a list aren’t cheap, especially left most If you _do_ need to do left most removals, the deque module is your friend. Rightmost removals aren’t too cheap either, depending on what idiocy people come up with to try and ‘help’ the interpreter: python -m timeit $’l=range(1000);i=0;\nwhile i < len(l):\n\tif l[i]!="asdf":del l[i]\n\telse:i+=1’ 100 loops, best of 3: 4.12 msec per loop python -m timeit $’l=range(1000);\nfor i in xrange(len(l)-1,-1,-1):\n\tif l[i]!="asdf":del l[i]’ 100 loops, best of 3: 3 msec per loop 3.1. Content 27 pkgcore Documentation, Release trunk python -m timeit ’l=range(1000);l=[x for x in l if x == "asdf"]’ 1000 loops, best of 3: 1 msec per loop Granted, that’s worst case, but the worst case is usually where people get bitten (note the best case still is faster for list comprehension). On a related note, don’t pop() unless you have a reason to. If you’re testing for None specifically, be aware of the ‘is’ operator Is avoids the equality protocol, and does a straight ptr comparison: python -m timeit ’10000000 != None’ 1000000 loops, best of 3: 0.721 usec per loop $ python -m timeit ’10000000 is not None’ 1000000 loops, best of 3: 0.343 usec per loop Note that we’re specificially forcing a large int; using 1 under 2.5 is the same runtime, the reason for this is that it defaults to an identity check, then a comparison; for small ints, python uses singletons, thus identity kicks in. Deprecated/crappy modules • Don’t use types module. Use isinstance (this isn’t a speed reason, types sucks). • Don’t use strings module. There are exceptions, but use string methods when available. • Don’t use stat module just to get a stat attribute- e.g.,:: import stats l=os.stat(“asdf”)[stat.ST_MODE] # can be done as (and a bit cleaner) l=os.stat(“asdf”).st_mode Know the exceptions that are thrown, and catch just those you’re interested in try: blah except Exception: blah2 There is a major issue here. It catches SystemExit exceptions (triggered by keyboard interupts); meaning this code, which was just bad exception handling now swallows Ctrl+c (meaning it now screws with UI code). Catch what you’re interested in only. tuples versus lists. The former is immutable, while the latter is mutable. Lists over-allocate (a cpython thing), meaning it takes up more memory then is used (this is actually a good thing usually). If you’re generating/storing a lot of sequences that shouldn’t be modified, use tuples. They’re cheaper in memory, and people can reference the tuple directly without being concerned about it being mutated elsewhere. However, using lists there would require each consumer to copy the list to protect themselves from mutation. So... over-allocation + allocating a new list for each consumer. Bad, mm’kay. 28 Chapter 3. Developer Notes pkgcore Documentation, Release trunk Don’t try to copy immutable instances (e.g. tuples/strings) Example: copy.copy((1,2,3)) is dumb; nobody makes a mistake that obvious, but in larger code people do (people even try using [:] to copy a string; it returns the same string since it’s immutable). You can’t modify them, therefore there is no point in trying to make copies of them. __del__ methods mess with garbage collection __del__ methods have the annoying side affect of blocking garbage collection when that instance is involved in a cycle- basically, the interpreter doesn’t know what __del__ is going to reference, so it’s unknowable (general case) how to break the cycle. So... if you’re using __del__ methods, make sure the instance doesn’t wind up in a cycle (whether careful data structs, or weakref usage). A general point: python isn’t slow, your algorithm is l = [] for x in data_generator(): if x not in l: l.append(x) That code is _best_ case O(1) (e.g., yielding all 0’s). The worst case is O(N^2). l=set() for x in data_generator(): if x not in l: l.add(x) Best/Worst are now constant (this isn’t strictly true due to the potential expansion of the set internally, but that’s ignorable in this case). Furthermore, the first loop actually invokes the __eq__ protocol for x for each element, which can potentially be quite slow if dealing in complex objs. The second loop invokes __hash__ once on x instead (technically the set implementation may invoke __eq__ if a collision occurs, but that’s an implementation detail). Technically, the second loop still is a bit innefficient: l=set(data_generator()) is simpler and faster. An example data for people who don’t see how _bad_ this can get: python -m timeit $’l=[]\nfor x in xrange(1000):\n\tif x not in l:l.append(x)’ 10 loops, best of 3: 74.4 msec per loop python -m timeit $’l=set()\nfor x in xrange(1000):\n\tif x not in l:l.add(x)’ 1000 loops, best of 3: 1.24 msec per loop python -m timeit ’l=set(xrange(1000))’ 1000 loops, best of 3: 278 usec per loop The difference here is obvious. 3.1. Content 29 pkgcore Documentation, Release trunk This does _not_ mean that sets are automatically better everywhere, just be aware of what you’re doing- for a single search of a range, the setup overhead is far slower then a linear search. Nature of sets, while the implementation may be able to guess the proper list size, it still has to add each item in; if it cannot guess the size (ie, no size hint, generator, iterator, etc), it has to just keep adding items in, expanding the set as needed (which requires linear walks for each expansion). While this may seem obvious, people sometimes do effectively the following: python -m timeit -s ’l=range(50)’ $’if 1001 in set(l): pass’ 100000 loops, best of 3: 12.2 usec per loop python -m timeit -s ’l=range(50)’ $’if 1001 in l: pass’ 10000 loops, best of 3: 7.68 usec per loop What’s up with __hash__ and dicts A bunch of things (too many things most likely) in the codebase define __hash__. The rule for __hash__ is (quoted from http://docs.python.org/ref/customization.html): Should return a 32-bit integer usable as a hash value for dictionary operations. The only required property is that objects which compare equal have the same hash value. Here’s a quick rough explanation for people who do not know how a “dict” works internally: • Things added to it are dumped in a “bucket” depending on their hash value. • To check if something is in the dict it first determines the bucket to check (based on hash value), then does equality checks (__cmp__ or __eq__ if there is one, otherwise object identity comparison) for everything in the bucket (if there is anything). So what does this mean? • There’s no reason at all to define your own __hash__ unless you also define __eq__ or __cmp__. The behaviour of your object in dicts/sets will not change, it will just be slower (since your own __hash__ is almost certainly slower than the default one). • If you define __eq__ or __cmp__ and want your object to be usable in a dict you have to define __hash__. If you don’t the default __hash__ is used which means your objects act in dicts like only object identity matters until you hit a hash collision and your own __eq__ or __cmp__ kicks in. • If you do define your own __hash__ it has to produce the same value for objects that compare equal, or you get really weird behaviour in dicts/sets (“thing in dict” returning False because the hash values differ while “thing in dict.keys()” returns True because that does not use the hash value, only equality checks). • If the hash value changes after the object was put in a dict you get weird behaviour too (“s=set([thing]); thing.change_hash();thing in s” is False, but “thing in list(s)” is True). So if your objects are mutable they can usually provide __eq__/__cmp__ but not __hash__. • Not having many hash “collisions” (same hash value for objects that compare nonequal) is good, but collisions are not illegal. Too many of them just slow down dict/set operations (in a worst case scenario of the same hash for every object dict/set operations become linear searches through the single hash bucket everything ends up in). • If you use the hash value directly keep in mind that collisions are legal. Do not use comparisons of hash values as a substitute for comparing objects (implementing __eq__ / __cmp__). Probably the only legitimate use of hash() is to determine an object’s hash value based on things used for comparison. __eq__ and __ne__ From http://docs.python.org/ref/customization.html: 30 Chapter 3. Developer Notes pkgcore Documentation, Release trunk There are no implied relationships among the comparison operators. The truth of x==y does not imply that x!=y is false. Accordingly, when defining __eq__(), one should also define __ne__() so that the operators will behave as expected. They really mean that. If you define __eq__ but not __ne__ doing ”!=” on instances compares them by identity. This is surprisingly easy to miss, especially since the natural way to write unit tests for classes with custom comparisons goes like this: self.assertEqual(YourClass(1), YourClass(1)) # Repeat for more possible values. Uses == and therefore __eq__, # behaves as expected. self.assertNotEqual(YourClass(1), YourClass(2)) # Repeat for more possible values. Uses != and therefore object # identity, so they all pass (all different instances)! So you end up only testing __eq__ on equal values (it can say “identical” for different values without you noticing). Adding a __ne__ that just does “return not self == other” fixes this. __eq__/__hash__ and subclassing If your class has a custom __eq__ and it might be subclassed you have to be very careful about how you “compare” to instances of a subclass. Usually you will want to be “different” from those unconditionally: def __eq__(self, other): if self.__class is not YourClass or other.__class__ is not YourClass: return False # Your actual code goes here This might seem like overkill, but it is necessary to avoid problems if you are subclassed and the subclass does not have a new __eq__. If you just do an “isinstance(other, self.__class__)” check you will compare equal to instances of a subclass, which is usually not what you want. If you just check for “self.__class__ is other.__class__” then subclasses that add a new attribute without overriding __eq__ will compare equal when they should not (because the new attribute differs). If you subclass something that has an __eq__ you should most likely override it (you might get away with not doing so if the class does not do the type check demonstrated above). If you add a new attribute don’t forget to override __hash__ too (that is not critical, but you will have unnecessary hash collisions if you forget it). This is especially important for pkgcore because of pkgcore.util.caching. If an instance of a class with a broken __eq__ is used as argument for the __init__ of a class that uses caching.WeakInstMeta it will cause a cached instance to be used when it should not. Notice the class with the broken __eq__ does not have to be cached itself to trigger this! Getting this wrong can cause fun behaviour like atoms showing up in the list of fetchables because the restrictions they’re in compare equal independent of their “payload”. Exception subclassing It is pretty common for an Exception subclass to want to customize the return value of str() on an instance. The easiest way to do that is: class MyException(Exception): """Describe when it is raised here.""" def __init__(self, stuff): Exception.__init__(self, ’MyException because of %s’ % (stuff,)) 3.1. Content 31 pkgcore Documentation, Release trunk This is usually easier than defining a custom __str__ (since you do not have to store the value of “stuff” as an attribute) and you should be calling the base class __init__ anyway. (This does not mean you should never store things like “stuff” as attrs: it can be very useful for code catching the exception to have access to it. Use common sense.) Memory debugging Either heappy, or dowser are the two currently recommended tools. To use dowser, insert the following into the code wherever you’d like to check the heap- this is blocking also: import cherrpy import dowser cherrypy.config.update({’engine.autoreload_on’: False}) try: cherrypy.quickstart(dowser.Root()) except AttributeError: cherrypy.root = dowser.Root() cherrypy.server.start() For using heappy, see the heappy documentation in pkgcore/dev-notes. 3.1.11 resolver Current design doesn’t coalesce- expects that each atom as it’s passed in specifies the dbs, which is how it does it’s update/empty-tree trickery. This isn’t optimal. Need to flag specific atoms/matches as “upgrade if possible” or “empty tree if possible”, etc; via this, we get coalescing behaviour. Specifically, if the targets are git[subversion] and subversion, we want both upgraded. So when resolving git[subversion] and encountering dev-util/subversion, we should aim for upgrading it per the commandline request. Additional question- should we apply this coalescing awareness to intermediate atoms along the way resolution wise? specifically, the cnf/dnf solutions, grabbing those and stating “yeah, collapse to these if possible since they’re likely required” ? 3.1.12 resolver redesign Hate to say it, but should go back to a specific ‘resolve’ method w/ the resolver plan object holding targets- reason being, we may have to backtrack the whole way. 3.1.13 config/use issues need to find a way to clone a stack, getting a standalone config stack if possible for the resolver- specifically so it can do resets as needed, track what is involved (use dep forcing) w/out influencing preexisting access to that tree, nor being affected by said usage. 3.1.14 hardlink merge no comments, just need to get around to it. 32 Chapter 3. Developer Notes pkgcore Documentation, Release trunk 3.1.15 How to use guppy/heapy for tracking down memory usage This is a work in progress. It will grow a bit and it may not be entirely accurate everywhere. Tutorial of sorts All this was done on a checkout of [email protected], you should be able to check that out and follow along using something like: bzr revert -rrevid:[email protected] in a pkgcore branch. Heapy is powerful but has a learning curve. Problems are the documentation (http://guppype.sourceforge.net/heapy_Use.html among others) is a bit unusual and there are various dynamic importing and other tricks in use that mean things like dir() are less helpful than they are on more “normal” python objects. This document’s main purpose is to show you how to ask heapy various kinds of questions. It may or may not show a few cases where pkgcore uses more memory than it should too. First, get an x86. Heapy currently does not like 64 bit archs much. Emerge it: emerge guppy Fire up an interactive python prompt, set stuff up: >>> >>> >>> >>> from guppy import hpy from pkgcore.config import load_config c = load_config() hp = hpy() Just to show how annoying heapy’s internal tricks are: >>> dir(hp) [’__doc__’, ’__getattr__’, ’__init__’, ’__module__’, ’__setattr__’, ’_hiding_tag_’, ’_import’, ’_name >>> help(hp) Help on class _GLUECLAMP_ in module guppy.etc.Glue: _GLUECLAMP_ = <guppy.heapy.Use interface at 0x-484b8554> This object is your “starting point”, but as you can see the underlying machinery is not giving away any useful usage instructions. Do everything that allocates some memory but is not the problem you are tracking down now. Then do: >>> hp.setrelheap() Everything allocated before this call will not be in the data sets you get later. Now do your memory-intensive thing: >>> l = list(x for x in c.repo["portdir"] if x.data) Keep an eye on system memory consumption. You want to use up a lot but not all of your system ram for nicer statistics. The python process was eating about 109M res in top when the above stuff finished, which is pretty good (for my 512mb ram box). >>> h = hp.heap() 3.1. Content 33 pkgcore Documentation, Release trunk The fun one. This object is basically a snapshot of what’s reachable in ram (minus the stuff excluded through setrelheap earlier) which you can do various fun tricks with. Its str() is a summary: >>> h Partition of a Index Count 0 985931 1 24681 2 49391 3 115974 4 152181 5 36009 6 11328 7 24702 8 11424 9 24681 <54 more rows. set of 1449133 objects. Total % Size % Cumulative % 68 46300932 45 46300932 45 2 22311624 22 68612556 67 3 21311864 21 89924420 88 8 3776948 4 93701368 91 11 3043616 3 96744984 94 2 1584396 2 98329380 96 1 1540608 1 99869988 97 2 889272 1 100759260 98 1 851840 1 101611100 99 2 691068 1 102302168 100 Type e.g. ’_.more’ to view.> size = 102766644 bytes. Kind (class / dict of class) str dict of pkgcore.ebuild.ebuild_src.package dict (no owner) tuple long weakref.KeyedRef dict of pkgcore.ebuild.ebuild_src.ThrowAwayNameSpace types.MethodType list pkgcore.ebuild.ebuild_src.package (You might want to keep an eye on ram usage: heapy made the process grow another dozen mb here. It gets painfully slow if it starts swapping, so if that happens reduce your data set). Notice the “Total size” in the top right: about 100M. That’s what we need to compare later numbers with. So here we can see that (surprise!) we have a ton of strings in memory. We also have various kinds of dicts. Dicts are treated a bit specially: the “dict of pkgcore.ebuild.ebuild_src.package” simply means “all the dicts that are __dict__ attributes of instances of that class”. “dict (no owner)” are all the dicts that are not used as __dict__ attribute. You probably guessed what you can use “index” for: >>> h[0] Partition of a set of 985931 objects. Total size = 46300932 bytes. Index Count % Size % Cumulative % Kind (class / dict of class) 0 985931 100 46300932 100 46300932 100 str Ok, that looks pretty useless, but it really is not. The “sets” heapy gives you (like “h” and “h[0]”) are a bunch of objects, grouped together by an “equivalence relation”. The default one (with the crazy name “Clodo” for “Class or dict owner”) groups together all objects of the same class and dicts with the same owner. We can also partition the sets by a different equivalence relation. Let’s do a silly example first: >>> h.bytype Partition of a Index Count 0 985931 1 85556 2 115974 3 152181 4 36009 5 24702 6 11424 7 24681 8 11328 9 408 <32 more rows. set of 1449133 objects. Total % Size % Cumulative % 68 46300932 45 46300932 45 6 45226592 44 91527524 89 8 3776948 4 95304472 93 11 3043616 3 98348088 96 2 1584396 2 99932484 97 2 889272 1 100821756 98 1 851840 1 101673596 99 2 691068 1 102364664 100 1 317184 0 102681848 100 0 26112 0 102707960 100 Type e.g. ’_.more’ to view.> size = 102766644 bytes. Type str dict tuple long weakref.KeyedRef types.MethodType list pkgcore.ebuild.ebuild_src.package pkgcore.ebuild.ebuild_src.ThrowAwayNameSpace types.CodeType As you can see this is the same thing as the default view, but with all the dicts lumped together. A more useful one is: >>> h.byrcs Partition of a set of 1449133 objects. Total size = 102766644 bytes. Index Count % Size % Cumulative % Referrers by Kind (class / dict of class) 0 870779 60 43608088 42 43608088 42 dict (no owner) 34 Chapter 3. Developer Notes pkgcore Documentation, Release trunk 1 24681 2 221936 3 242236 4 6 5 36009 2 22311624 15 20575932 17 8588560 0 1966736 2 1773024 6 11328 7 26483 8 11328 9 3 <132 more rows. 22 20 8 2 2 65919712 86495644 95084204 97050940 98823964 64 84 93 94 96 1 1540608 1 100364572 98 2 800432 1 101165004 98 1 724992 1 101889996 99 0 393444 0 102283440 100 Type e.g. ’_.more’ to view.> pkgcore.ebuild.ebuild_src.package dict of pkgcore.ebuild.ebuild_src.package tuple dict of weakref.WeakValueDictionary dict (no owner), dict of pkgcore.ebuild.ebuild_src.package, weakref.KeyedRef pkgcore.ebuild.ebuild_src.ThrowAwayNameSpace list dict of pkgcore.ebuild.ebuild_src.ThrowAwayNameSpace dict of pkgcore.repository.prototype.IterValLazyDict What this does is: • for every object, find all its referrers • Classify those referrers using the “Clodo” relation you saw earlier • Create a set of those classifiers of referrers. That means a set of things like “tuple, dict of someclass”, not of actual referring objects. • Group together all the objects with the same set of classifiers of referrers. So now we know that we have a lot of objects referenced only by one or more dicts (still not very useful) and also a lot of them referenced by one “normal” dict, referenced by the dict of (meaning “an attribute of”) ebuild_src.package, and referenced by a WeakRef. Hmm, I wonder what those are. But let’s store this view of the data first, since it took a while to generate (“_” is a feature of the python interpreter, it’s always the last result): >>> byrcs = _ >>> byrcs[5] Partition of a set of 36009 objects. Total size = 1773024 bytes. Index Count % Size % Cumulative % Referrers by Kind (class / dict of class) 0 36009 100 1773024 100 1773024 100 dict (no owner), dict of pkgcore.ebuild.ebuild_src.package, weakref.KeyedRef Erm, yes, we knew that already. If you look in the top right of the table you can see it is still grouping the items by the kind of their referrer, which is not very useful here. To get more information we can change what they are grouped by: >>> byrcs[5].byclodo Partition of a set of 36009 objects. Total size = 1773024 bytes. Index Count % Size % Cumulative % Kind (class / dict of class) 0 36009 100 1773024 100 1773024 100 str >>> byrcs[5].bysize Partition of a set of 36009 objects. Total size = 1773024 bytes. Index Count % Size % Cumulative % Individual Size 0 10190 28 489120 28 489120 28 48 1 7584 21 394368 22 883488 50 52 2 7335 20 322740 18 1206228 68 44 3 3947 11 221032 12 1427260 80 56 4 3364 9 134560 8 1561820 88 40 5 1903 5 114180 6 1676000 95 60 6 877 2 56128 3 1732128 98 64 7 285 1 19380 1 1751508 99 68 8 451 1 16236 1 1767744 100 36 9 57 0 4104 0 1771848 100 72 This took the set of objects with that odd set of referrers and redisplayed them grouped by “clodo”. So now we know they’re all strings. Most of them are pretty small too. To get some idea of what we’re dealing with we can pull some random examples out: 3.1. Content 35 pkgcore Documentation, Release trunk >>> byrcs[5].byid Set of 36009 <str> objects. Total size = 1773024 bytes. Index Size % Cumulative % Representation (limited) 0 80 0.0 80 0.0 ’media-plugin...re20051219-r1’ 1 76 0.0 156 0.0 ’app-emulatio...4.20041102-r1’ 2 76 0.0 232 0.0 ’dev-php5/ezc...hemaTiein-1.0’ 3 76 0.0 308 0.0 ’games-misc/f...wski-20030120’ 4 76 0.0 384 0.0 ’mail-client/...pt-viewer-0.8’ 5 76 0.0 460 0.0 ’media-fonts/...-100dpi-1.0.0’ 6 76 0.0 536 0.0 ’media-plugin...gdemux-0.10.4’ 7 76 0.0 612 0.0 ’media-plugin...3_pre20051219’ 8 76 0.0 688 0.0 ’media-plugin...3_pre20051219’ 9 76 0.0 764 0.0 ’media-plugin...3_pre20060502’ >>> byrcs[5].byid[0].theone ’media-plugins/vdr-streamdev-server-0.3.3_pre20051219-r1’ A pattern emerges! (sets with one item have a “theone” attribute with the actual item, all sets have a “nodes” attribute that returns an iterator yielding the items). We could have used another heapy trick to get a better idea of what kind of string this was: >>> byrcs[5].byvia Partition of a set of 36009 objects. Total size = 1773024 bytes. Index Count % Size % Cumulative % Referred Via: 0 1 0 80 0 80 0 "[’cpvstr’]", ’.key’, 1 1 0 76 0 156 0 "[’cpvstr’]", ’.key’, 2 1 0 76 0 232 0 "[’cpvstr’]", ’.key’, 3 1 0 76 0 308 0 "[’cpvstr’]", ’.key’, 4 1 0 76 0 384 0 "[’cpvstr’]", ’.key’, 5 1 0 76 0 460 0 "[’cpvstr’]", ’.key’, 6 1 0 76 0 536 0 "[’cpvstr’]", ’.key’, 7 1 0 76 0 612 0 "[’cpvstr’]", ’.key’, 8 1 0 76 0 688 0 "[’cpvstr’]", ’.key’, 9 1 0 76 0 764 0 "[’cpvstr’]", ’.key’, <35999 more rows. Type e.g. ’_.more’ to view.> ’.keys()[23147]’ ’.keys()[12285]’ ’.keys()[12286]’ ’.keys()[16327]’ ’.keys()[17754]’ ’.keys()[19079]’ ’.keys()[21704]’ ’.keys()[23473]’ ’.keys()[24239]’ ’.keys()[3070]’ Ouch, 36009 total rows for 36009 objects. What this did is similar to what “byrcs” did: for every object in the set it determined how they can be reached through their referrers, then groups objects that can be reached in the same ways together. Unfortunately it is grouping everything reachable as a dictionary key differently, so this is not very useful. XXX WTF XXX It is not likely this accomplishes anything, but let’s assume we want to know if there are any objects in this set not reachable as the “key” attribute. Heapy can tell us (although this is very slow... there might be a better way but I do not know it yet): >>> nonkeys = byrcs[5] & hp.Via(’.key’).alt(’<’) >>> nonkeys.byrcs hp.Nothing (remember “hp” was our main entrance into heapy, the object that gave us the set of all objects we’re interested in earlier). What does this do? “hp.Via(‘.key’)” creates a “symbolic set” of “all objects reachable only as the ‘key’ attribute of something” (it’s a “symbolic set” because there are no actual objects in it). The “alt” method gives us a new symbolic set of everything reachable via “less than” this way. We then intersect this with our set and discover there is nothing left. A similar construct that does not do what we want is: 36 Chapter 3. Developer Notes pkgcore Documentation, Release trunk >>> nonkeys = byrcs[5] & ~hp.Via(’.key’) The “~” operator inverts the symbolic set, giving a set matching everything not reachable exactly as a “key” attribute. The key word here is “exactly”: since everything in our set was also reachable in two other ways this intersection matches everything. Ok, let’s get back to the stuff actually eating memory: >>> h[0].byrcs Index Count % Size 0 670791 68 31716096 1 139232 14 6525856 2 136558 14 6042408 3 36009 4 1773024 4 5 6 7 8 9 1762 824 140 194 30 55 0 0 0 0 0 0 107772 69476 31312 11504 6284 1972 % Cumulative % Referrers by Kind (class / dict of class) 68 31716096 68 dict (no owner) 14 38241952 83 tuple 13 44284360 96 dict of pkgcore.ebuild.ebuild_src.package 4 46057384 99 dict (no owner), dict of pkgcore.ebuild.ebuild_src.package, weakref.KeyedRef 0 46165156 100 list 0 46234632 100 types.CodeType 0 46265944 100 function, tuple 0 46277448 100 dict of module 0 46283732 100 dict of type 0 46285704 100 dict of module, tuple Remember h[0] gave us all str objects, so this is all string objects grouped by the kind(s) of their referrers. Also notice index 3 here is the same set of stuff we saw earlier: >>> h[0].byrcs[3] ^ byrcs[5] hp.Nothing Most operators do what you would expect, & intersects for example. “We have a lot of strings in dicts” is not that useful either, let’s see if we can narrow that down a little: >>> h[0].byrcs[0].referrers.byrcs Partition of a set of 44124 objects. Total size = 18636768 bytes. Index Count % Size % Cumulative % Referrers by Kind (class / dict of class) 0 24681 56 12834120 69 12834120 69 dict of pkgcore.ebuild.ebuild_src.package 1 19426 44 5371024 29 18205144 98 dict (no owner) 2 1 0 393352 2 18598496 100 dict of pkgcore.repository.prototype.IterValLazyDict 3 1 0 6280 0 18604776 100 __builtin__.set 4 1 0 6280 0 18611056 100 dict of module, guppy.heapy.heapyc.RootStateType 5 1 0 6280 0 18617336 100 dict of pkgcore.ebuild.eclass_cache.cache 6 1 0 6280 0 18623616 100 dict of pkgcore.repository.prototype.PackageIterValLazyDict 7 4 0 5536 0 18629152 100 type 8 4 0 3616 0 18632768 100 dict of type 9 1 0 1672 0 18634440 100 dict of module, dict of os._Environ (Broken down: h[0].byrcs[0] is the set of all str objects referenced only by dicts, h[0].byrcs[0].referrers is the set of those dicts, and the final .byrcs displays those dicts grouped by their referrers) Keep an eye on the size column. We have over 12M worth of just dicts (not counting the stuff in them) referenced only as attribute of ebuild_src.package. If we include the stuff kept alive by those dicts we’re talking about a big chunk of the 100MB total here: >>> t = _ >>> t[0].domisize 61269552 60M out of our 100M would be deallocated if we killed those dicts. So let’s ask heapy what dicts that are: 3.1. Content 37 pkgcore Documentation, Release trunk >>> t[0].byvia Partition of a set of 24681 objects. Total size = 12834120 bytes. Index Count % Size % Cumulative % Referred Via: 0 24681 100 12834120 100 12834120 100 "[’data’]" (it is easy to get confused by the “byrcs” view of our “t”. t[0] is not a bunch of “dict of ebuild_src.package”. It is a bunch of dicts with strings in them, namely those that are referred to by the dict of ebuild_src.package, and not by anything else. So the byvia output means those dicts with strings in them are all “data” attributes of ebuild_src.package instances). (sidenote: earlier we saw byvia say ”.key”, now it says “[’data’]”. It’s different because the previous type used __slots__ (so there was no “dict of” involved) and this type does not (so there is a “dict of” and our dicts are the “data” key in it). So what is in the dicts: >>> t[0].referents Partition of a set of 605577 objects. Total size = 34289392 bytes. Index Count % Size % Cumulative % Kind (class / dict of class) 0 556215 92 27710068 81 27710068 81 str 1 24681 4 6085704 18 33795772 99 dict (no owner) 2 24681 4 493620 1 34289392 100 long >>> _.byvia Partition of a set of 605577 objects. Total size = 34289392 bytes. Index Count % Size % Cumulative % Referred Via: 0 24681 4 6085704 18 6085704 18 "[’_eclasses_’]" 1 21954 4 3742976 11 9828680 29 "[’DEPEND’]" 2 22511 4 3300052 10 13128732 38 "[’RDEPEND’]" 3 24202 4 2631304 8 15760036 46 "[’SRC_URI’]" 4 24681 4 1831668 5 17591704 51 "[’DESCRIPTION’]" 5 24674 4 1476680 4 19068384 56 "[’HOMEPAGE’]" 6 24681 4 1297680 4 20366064 59 "[’KEYWORDS’]" 7 24681 4 888516 3 21254580 62 ’.keys()[3]’ 8 24681 4 888516 3 22143096 65 ’.keys()[9]’ 9 24681 4 810108 2 22953204 67 "[’LICENSE’]" <32 more rows. Type e.g. ’_.more’ to view.> Strings, nested dicts and longs, and most size eaten up by the “_eclasses_” values. There is also a significant amount eaten up by keys values, which is a bit odd, so let’s investigate: >>> refs = t[0].referents >>> i=iter(refs.byvia[7].nodes) >>> i.next() ’DESCRIPTION’ >>> i.next() ’DESCRIPTION’ >>> i.next() ’DESCRIPTION’ >>> i.next() ’DESCRIPTION’ >>> i.next() ’DESCRIPTION’ Eep! >>> refs.byvia[7].bysize Partition of a set of 24681 objects. Total size = 888516 bytes. Index Count % Size % Cumulative % Individual Size 0 24681 100 888516 100 888516 100 36 38 Chapter 3. Developer Notes pkgcore Documentation, Release trunk It looks like we have 24681 identical strings here, using up about 1M of memory. The other odd entry is the ‘_eclasses_’ string apparently. Extra stuff for c extension developers To provide accurate statistics if your code uses extension types you must provide heapy with a way to get the following data for your custom types: • How large is a certain instance? • What objects does an instance contain? • How does the instance refer to a contained object? You provide these through a NyHeapDef struct, defined in heapdef.h in the guppy source. This header is not installed, so you should just copy it into your source tree. It is a good idea to read this header file side by side with the following descriptions, since it contains details omitted here. The stdtypes.c file contains implementations for the basic python types which you can read for inspiration. The NyHeapDef struct provides heapy with three function pointers: SizeGetter To answer “how large is an instance” you provide a NyHeapDef_SizeGetter function that is called with a PyObject* and returns an int: the number of bytes the object occupies. If you do not provide this function heapy uses a default that looks at the tp_basicsize and tp_itemsize fields of the type. This means that if you do not allocate any extra memory for non-python objects (e.g. for c strings) you do not need to provide this function. Traverser To answer “What objects does an instance contain” you provide a traversal function (NyHeapDef_Traverser). This is called with a pointer to a “visit procedure”, an instance of your extension type and some other stuff. You should then call the visit procedure for every python object contained in your object. This might sound familiar: to support the python garbage collector you provide a very similar function (tp_traverse). Actually heapy will use tp_traverse if you do not provide a heapy-specific traverse function. Doing this makes sense if you do not support the garbage collector for some reason, or if you contain objects that are irrelevant to the garbage collector. An example would be a type that contains a single python string object (that no other code can get a reference to). If this object does not have references to other python objects it cannot be involved in cycles so supporting gc would be useless. However you do still want heapy to know about the memory occupied by the contained string. You could do that by adding that size in your NyHeapDef_SizeGetter function but it is probably easier to tell heapy about the string through the traversal function (so you do not have to calculate the memory occupied by the string). If the above type would also have a reference to some arbitrary (non-private) python object it should support gc, but it does not need to tell gc about the contained string. So you would have two traversal functions, one for heapy that visits the string and one for gc that does not. RelationGetter The last function heapy wants tells it in what way your instance refers to some contained object. It is used to provide the “byvia” view. This calls a visit function once for each way your instance refers to a target object, telling it what kind of reference it is. 3.1. Content 39 pkgcore Documentation, Release trunk Providing the heapdef struct to heapy Once you have the needed function pointers in a struct you need to pass this to heapy somehow. This is done through a standard cpython mechanism called “cobjects”. From python these look like rather stupid objects you cannot do anything with, but from c you can pull out a void* that was put in when the object was constructed. You can wrap an arbitrary pointer in a CObject, make it available as attribute of your module, then import it from some other module, pull the void* back out and cast it to the original type. heapy looks for a _NyHeapDefs_ attribute on all loaded modules. If this attribute exists and is a CObject the pointer in it is used as a pointer to an array of NyHeapDef struct (terminated with a struct with only nulls). Example code doing this is in sets.c in the guppy source. 3.1.16 Plugins system Goals The plugin system (pkgcore.plugin) is used to pick up extra code (potentially distributed separately from pkgcore itself) at a place where using the config system is not a good idea for some reason. This means that for a lot of things that most people would call “plugins” you should not actually use pkgcore.plugin, you should use the config system. Things like extra repository types should simply be used as “class” value in the configuration. The plugin system is currently mainly used in places where handing in a ConfigManager is too inconvenient. Using plugins Plugins are looked up based on a string “key”. You can always look up all available plugins matching this key with pkgcore.plugin.get_plugins(key). For some kinds of plugin (the ones defining a “priority” attribute) you can also get the “best” plugin with pkgcore.plugin.get_plugin(key). This does not make sense for all kinds of plugin, so not all of them define this. The plugin system does not care about what kind of object plugins are, this depends entirely on the key. Adding plugins Basics, caching Plugins for pkgcore are loaded from modules inside the pkgcore.plugins package. This package has some magic to make plugins in any subdirectory pkgcore/plugins under a directory on sys.path work. So if pkgcore itself is installed in site-packages you can still add plugins to /home/you/pythonlib/pkgcore/plugins if /home/you/pythonlib is in PYTHONPATH. You should not put an __init__.py in this extra plugin directory. Plugin modules should contain a pkgcore_plugins directory that maps the “key” strings to a sequence of plugins. This dictionary has to be constant, since pkgcore keeps track of what plugin module provides plugins for what keys in a cache file to avoid unnecessary imports. So this is invalid: try: import spork_package except ImportError: pkgcore_plugins = {} else: pkgcore_plugins = {’myplug’: [spork_package.ThePlugin]} since if the plugin cache is generated while the package is not available pkgcore will cache the module as not providing any myplug plugins, and the cache will not be updated if the package becomes available (only changes to the mtime of actual plugin modules invalidate the cache). Instead you should do something like this: 40 Chapter 3. Developer Notes pkgcore Documentation, Release trunk try: from spork_package import ThePlugin except ImportError: class ThePlugin(object): disabled = True pkgcore_plugins = {’myplug’: [ThePlugin]} If a plugin has a “disabled” attribute the plugin system will never return it from get_plugin or get_plugins. Priority If you want your plugin to support get_plugin it should have a priority attribute: an integer indicating how “preferred” this plugin is. The plugin with the highest priority (that is not disabled) is returned from get_plugin. Some types of plugins need more information to determine a priority value. Those should not have a priority attribute. They should use get_plugins instead and have a method that gets passed the extra data and returns the priority. Import behaviour Assuming the cache is working correctly (it was generated after installing a plugin as root) pkgcore will import all plugin modules containing plugins for a requested key in priority order until it hits one that is not disabled. The “disabled” value is not cached (a plugin that is unconditionally disabled makes no sense), but the priority value is. You can fake a dynamic priority by having two instances of your plugin registered and only one of them enabled at the same time. This means it makes sense to have only one kind of plugin per plugin module (unless the required imports overlap): this avoids pulling in imports for other kinds of plugin when one kind of plugin is requested. The disabled value is not cached by the plugin system after the plugin module is imported. This means it should be a simple attribute (either completely constant or set at import time) or property that does its own caching. Adding a plugin package Both get_plugin and get_plugins take a plugin package as second argument. This means you can use the plugin system for external pkgcore-related tools without cluttering up the main pkgcore plugin directory. If you do this you will probably want to copy the __path__ trick from pkgcore/plugin/__init__.py to support plugins elsewhere on sys.path. 3.1.17 Pkgcore/Portage differences Disclaimer Pkgcore moves fairly fast in terms of development- we will strive to keep this doc up to date, but it may lag behind the actual code. Ebuild environment changes All changes are either glep33 related, or a tightening of the restrictions on the env to block common snafus that localize the ebuilds environment to that machine. 3.1. Content 41 pkgcore Documentation, Release trunk • portageq based functions are disabled in the global scope. Reasoning for this is that of QAhas_version/best_version must not affect the generated metadata. As such, portageq calls in the global scope are disabled. • inherit is disabled in all phases but depend and setup. Folks no longer do it, but inherit from within one of the build/install phases is now actively blocked. • The ebuild env is now effectively akin to suspending the process, and restarting it. Essentially, transitioning between ebuild phases, the ebuild environment is snapshotted, cleaned of irrevelent data (bash forced vars for example, or vars that pkgcore sets for the local system on each shift into a phase), and saved. Portage does this partially (re-execs ebuilds/eclasses, thus stomping the env on each phase change), pkgcore does it fully. As such, pkgcore is capable of glep33, while portage is not (env fixes are the basis of glep33). • ebuild.sh now protects itself from basic fiddling. Ebuild generated state must work as long as the EAPI is the same, regardless of the generating portage version, and the portage version that later uses the saved state (simple example, generated with portage-2.51, if portage 3 is EAPI compliant with that env, it must not allow it’s internal bash changes to break the env). As such, certain funcs are not modifiable by the ebuild- namely, internal portage/pkgcore functionality, hasq/useq for example. Those functions that are read-only also are not saved in the ebuild env (they should be supplied by the portage/pkgcore instance reloading the env). • ebuild.sh is daemonized. The upshot of this is that regen is roughly 2x faster (careful reuse of ebuild.sh instances rather then forcing bash to spawn all over). Additional upshot of this is that their are bidirectional communication pipes between ebuild.sh and the python parent- env inspection, logging, passing requests up to the python side (has_version/best_version for example) are now handled within the existing processes. Design of it from the python side is that of an extensible event handler, as such it’s extremely easy to add new commands in, or special case certain things. Repository Enhancements Pkgcore internally uses a sane/uniform repository abstraction- the benefits of this are: • repository class (which implements the accessing of the on disk/remote tree) is pluggable. Remote vdb/portdir is doable, as is having your repository tree ran strictly from downloaded metadata (for example), or running from a tree stored in a tarball/zip file (mildly crazy, but it’s doable). • seperated repository instances. We’ve not thrown out overlays (as paludis did), but pkgcore doesn’t force every new repository to be an overlay of the ‘master’ PORTDIR as portage does. • optimized repository classes- for the usual vdb and ebuild repository (those being on disk backwards compatible with portage 2.x), the number of syscalls required was drastically reduced, with ondisk info (what packages available per category for example) cached. It is a space vs time trade off, but the space trade off is neglible (couple of dict with worst case, 66k mappings)- as is, portage’s listdir caching consumed a bit more memory and was slower, so all in all a gain (mainly it’s faster with using slightly less memory then portages caching). • unique package instances yielded from repository. Pkgcore uses a package abstraction internally for accessing metadata/version/category, etc- all instances returned from repositories are unique immutable instances. Gain of it is that if you’ve got dev-util/diffball-0.7.1 sitting in memory already, it will return that instance instead of generating a new one- and since metadata is accessed via the instance, you get at most one load from the cache backend per instance in memory- cache pull only occurs when required also. As such, far faster for when doing random package accessing and storing of said packages (think repoman, dependency resolution, etc). 3.1.18 Tackling domain tag a ‘x’ in front of stuff that’s been implemented unhandled (eg, figure these out) vars/features 42 Chapter 3. Developer Notes pkgcore Documentation, Release trunk • (user)?sandbox • userpriv(_fakeroot)? • digest • cvs (this option is a hack) • fixpackages , which probably should be a sync thing (would need to bind the vdb and binpkg repo to it though) • keep(temp|work), easy to implement, but where to define it? • PORT_LOGDIR • env overrides of use... vdb wrapper/vdb repo instantiation (either domain created wrapper, or required in the vdb repo section def) • CONFIG_PROTECT* • collision-protect • no(doc|man|info|clean) (wrapper/mangler) • suidctl • nostrip. in effect, strip defaults to on; wrappers if after occasionally on, occasionally off. • sfperms build section (vars) • C(HOST|TARGET), (LD*|C*)FLAGS? • (RESUME|FETCH)COMMAND are fetcher things, define it there. • MAKEOPTS • PORTAGE_NICENESS (imo) • TMPDIR ? or domain it? gpg is bound to repo, class type specifically. strict/severe are likely settings of it. the same applies for profiles. distlocks is a fetcher thing, specifically (probably) class type. buildpkgs is binpkg + filters. package.provided is used to generate a seperate vdb, a null vdb that returns those packages as installed. 3.1.19 Testing We use twisted.trial for our tests, to run the test framework run: trial pkgcore Your own tests must be stored in pkgcore.test - furthermore, tests must pass when ran repeatedly (-u option). You will want at least twisted-2.2 for that, <2.2 has a few false positives. Testing for negative assertions When coding it’s easy to write test cases asserting that you get result xyz from foo, usually asserting the correct flow. This is ok if nothing goes wrong, but that doesn’t normally happen. :) 3.1. Content 43 pkgcore Documentation, Release trunk Negative assertions (there probably is a better term for it) means asserting failure conditions and ensuring that the code handles zyx properly when it gets thrown at it. Most test cases seem to miss this, resulting in bugs being able to hide away for when things go wrong. Using –coverage When writing tests for your code (or for existing code without any tests), it is very useful to use –coverage. Run trial –coverage <path/to/test>, and then check <cwd>/_trial_temp/coverage/<test/module/name>. Any lines prefixed with ‘>>>>>’ have not been covered by your tests. This should be rectified before your code is merged to mainline (though this is not always possible). Those lines prefixed with a number show the number of times that line of code is evaluated. 3.1.20 perl CPAN • makeCPANstub in Gentoo/CPAN.pm , dumps cpan config • screen scraping to get deps, example page http://kobesearch.cpan.org/, use getCPANInfo from CPAN • use FindDeps for this • use unmemoize(func) to back out the memoizing of a func; do this on FindDeps 3.1.21 dpkg this is just basic notes, nothing more. If you know details, fill in the gaps kindly repos are combined. Sources.gz (list of source based deb’s) holds name, version, and build deps. Packages.gz (binary debs, dpkgs) name, version, size, short and long description, bindeps. repository layout: dists stable main arch #binary-arm fex source #? contrib #? arch # binary-arm fex source non-free # guess. arch source testing... unstable... arch/binary-* dirs hold Packages.gz, and Release (potentially) source dirs hold Sources.gz and Release (optionally) has preinst, postinst, prerm, postrm Same semantics as ebuilds in terms of when to run (coincidence? :) in dpkg Build-Depends Depends Pre-Depends Conflicts 44 in ebuild our DEPEND our RDEPEND configure time DEPEND blockers, affected by Essential (read up on this in debian policy guide) Chapter 3. Developer Notes pkgcore Documentation, Release trunk 3.1.22 WARNING This is the original brain dump from harring; it is not guranteed to be accurate to the current design, it’s kept around to give an idea of where things came from to contrast to what is in place now. 3.1.23 Introduction e’yo. General description of layout/goals/info/etc, and semi sortta api. That and aggregator of random ass crazy quotes should people get bored. DISCLAIMER This ain’t the code. In other words, the actual design/code may be radically different, and this document probably will trail any major overhauls of the design/code (speaking from past experience). Updates welcome, as are suggestions and questions- please dig through all documentations in the dir this doc is in however, since there is a lot of info (both current and historical) related to it. Collapsing info into this doc is attempted, but explanation of the full restriction protocol (fex) is a lot of info, and original idea is from previous redesign err... designs. Short version, historical, but still relevant info for restriction is in layout.txt. Other subsystems/design choices have their basis quite likely from other docs in this directory, so do your homework please :) Terminology cp category/package cpv category/package-version ROOT livefs merge point, fex /home/bharring/embedded/arm-target or more commonly, root=/ vdb /var/db/pkg, installed packages database. domain combination of repositories, root, and build information (use flags, cflags, etc). config data + repositories effectively. repository trees. ebuild tree (/usr/portage), binpkg tree, vdb tree, etc. protocol python name for design/api. iter() fex, is a protocol; for iter(o) it does i=o.__iter__(); the returned object is expected to yield an element when i.next() is called, till it runs out of elements (then throwing a StopIteration). hesitate to call it defined hook on a class/instance, but this (crappy) description should suffice. seq sequence, lists/tuples set list without order (think dict.keys()) General design/idea/approach/requirements All pythonic components installed by pkgcore must be within pkgcore.* namespace. No more polluting python’s namespace, plain and simple. Third party plugins to pkgcore aren’t bound by this however (their mess, not ours). API flows from the config definitions, everything internal is effectively the same. Basically, config data gives you your starter objects which from there, you dig deeper into the innards as needed action wise. The general design is intended to heavily abuse OOP. Further, delegation of actions down to components must be abided by, example being repo + cache interaction. repo does what it can, but for searching the cache, let the cache do it. Assume what you’re delegating to knows the best way to handle the request, and probably can do it’s job better then some external caller (essentially). 3.1. Content 45 pkgcore Documentation, Release trunk Actual configuration is pretty heavily redesigned. Classes and functions that should be constructed based on data from the user’s configuration have a “hint” describing their arguments. The global config class uses these hints to convert and typecheck the values in the user’s configuration. Actual configuration file reading and type conversion is done by a separate class, meaning the global manager is not tied to a single format, or even to configuration read from a file on disk. Encapsulation, extensibility/modularity, delegation, and allowing parallelizing of development should be key focuses in implementing/refining this high level design doc. Realize parallelizing is a funky statement, but it’s apt; work on the repo implementations can proceed without being held up by cache work, and vice versa. Final comment re: design goals, defining chunks of callable code and plugging it into the framework is another bit of a goal. Think twisted, just not quite as prevalent (their needs/focus is much different from ours, twisted is the app, your code is the lib, vice versa for pkgcore). Back to config. Here’s general notion of config ‘chunks’ of the subsystem, (these map out to run time objects unless otherwise stated): domain +-- profile (optional) +-- fetcher (default) +-- repositories +-- resolver (default) +-- build env data? | never actually instantiated, no object) \-- livefs_repo (merge target, non optional) repository +-- cache (optional) +-- fetcher (optional) +-- sync (optional, may change) \-- sync cache (optional, may chance) profile +-- build env? +-- sets (system mainly). \-- visibility wrappers domain is configuration data, accept_(license|keywords), use, cflags, chost, features, etc. profile, dependent on the profile class you choose is either bound to a repository, or to user defined location on disk (/etc/portage/profile fex). Domain knows to do incremental crap upon profile settings, lifting package.* crap for visibility wrappers for repositories also. repositories is pretty straightforward. portdir, binpkg, vdb, etc. Back to domain. Domain’s are your definition of pretty much what can be done. Can’t do jack without a domain, period. Can have multiple domains also, and domains do not have to be local (remote domains being a different class type). Clarifying, think of 500 desktop boxes, and a master box that’s responsible for managing them. Define an appropriate domain class, and appropriate repository classes, and have a config that holds the 500 domains (representing each box), and you can push updates out via standard api trickery. In other words, the magic is hidden away, just define remote classes that match defined class rules (preferably inheriting from the base class, since isinstance sanity checks will become the norm), and you could do emerge –domain some-remote-domain -u glsa on the master box. Emerge won’t know it’s doing remote crap. Pkgcore won’t even. It’ll just load what you define in the config. Ambitious? Yeah, a bit. Thing to note, the remote class additions will exist outside of pkgcore proper most likely. Develop the code needed in parallel to fleshing pkgcore proper out. Meanwhile, the remote bit + multiple domains + class overrides in config definition is _explicitly_ for the reasons above. That and x-compile/embedded target building, which is a bit funkier. Currently, portage has DEPEND and RDEPEND. How do you know what needs be native to that box to build the 46 Chapter 3. Developer Notes pkgcore Documentation, Release trunk package, what must be chost atoms? Literally, how do you know which atoms, say the toolchain, must be native vs what package’s headers/libs must exist to build it? We need an additional metadata key, BDEPEND (build depends). If you have BDEPEND, you know what actually is ran locally in building a package, vs what headers/libs are required. Subtle difference, but BDEPEND would allow (with a sophisticated depresolver) toolchain to be represented in deps, rather then the current unstated dep approach profiles allow. Aside from that, BDEPEND could be used for x-compile via inter-domain deps; a ppc target domain on a x86 box would require BDEPEND from the default domain (x86). So... that’s useful. So far, no one has shot this down, moreso, come up with reasons as to why it wouldn’t work, the consensus thus far is mainly “err, don’t want to add the deps, too much work”. Regarding work, use indirection. virtual/toolchain-c metapkg (glep37) that expands out (dependent on arch) into whatever is required to do building of c sources virtual/toolchain-c++ same thing, just c++ virtual/autootols take a guess. virtual/libc this should be tagged into rdepends where applicable, packages that directly require it (compiled crap mainly) Yes it’s extra work, but the metapkgs above should cover a large chunk of the tree, say >90%. Config design Portage thus far (<=2.0.51*) has had variable ROOT (livefs merge point), but no way to vary configuration data aside from via a buttload of env vars. Further, there has been only one repository allowed (overlays are just that, extensions of the ‘master’ repository). Addition of support of any new format is mildly insane due to hardcoding up the wing wang in the code, and extension/modification of existing formats (ebuild) has some issues (namely the doebuild block of code). Goal is to address all of this crap. Format agnosticism at the repository level is via an abstracted repository design that should supply generic inspection attributes to match other formats. Specialized searching is possible via match, thus extending the extensibility of the prototype repository design. Format agnosticism for building/merging is somewhat reliant on the repo, namely package abstraction, and abstraction of building/merging operations. On disk configurations for alternatives formats is extensible via changing section types, and plugging them into the domain definition. Note alt. formats quite likely will never be implemented in pkgcore proper, that’s kind of the domain of pkgcore addons. In other words, dpkg/rpm/whatever quite likely won’t be worked on by pkgcore developers, at least not in the near future (too many other things to do). The intention is to generalize the framework so it’s possible for others to do so if they choose however. Why is this good? Ebuild format has issues, as does our profile implementation. At some point, alternative formats/non-backwards compatible tweaks to the formats (ebuild or profile) will occur, and then people will be quite happy that the framework is generalized (seriously, nothing is lost from a proper abstracted design, and flexibility/power is gained). config’s actions/operation pkgcore.config.load_config() is the entrance point, returns to you a config object (pkgcore.config.central). This object gives you access to the user defined configs, although only interest/poking at it should be to get a domain object from it. 3.1. Content 47 pkgcore Documentation, Release trunk domain object is instantiated by config object via user defined configuration. domains hold instantiated repositories, bind profile + user prefs (use/accept_keywords) together, and _should_ simplify this data into somewhat user friendly methods. (define this better). Normal/default domain doesn’t know about other domains, nor give a damn. Embedded targets are domains, and _will_ need to know about the livefs domain (root=/), so buildplan creation/handling may need to be bound into domains. Objects/subsystems/stuff So... this is general naming of pretty much top level view of things, stuff emerge would be interested in (and would fool with). hesitate to call it a general api, but it probably will be as such, exempting any abstraction layer/api over all of this (good luck on that one }:] ). IndexableSequence functions as a set and dict, with caching and on the fly querying of info. mentioned due to use in repository and other places... (it’s a useful lil sucker) This actually is misnamed. the order of iteration isn’t necessarily reproducable, although it’s usually constant. IOW, it’s normally a sequence, but the class doesn’t implicitly force it LazyValDict similar to ixseq, late loading of keys, on fly pulling of values as requested. global config object (from pkgcore.config.load_config()) see config.rst. domain object bit of debate on this one I expect. any package.{mask,unmask,keywords} mangling is instantiated as a wrapper around repository instances upon domain instantiation. code should be smart and lift any package.{mask,unmask,keywords} wrappers from repositoriy instances and collapse it, pointing at the raw repo (basically don’t have N wrappers, collapse it into a single wrapper). Not worth implementing until the wrapper is a faster implementation then the current pkgcore.repository.visibility hack though (currently O(N) for each pkg instance, N being visibility restrictions/atoms). Once it’s O(1), collapsing makes a bit more sense (can be done in parallel however). a word on inter repository dependencies... simply put, if the repository only allows satisfying deps from the same repository, the package instance’s *DEPEND atom conversions should include that restriction. Same trickery for keeping ebuilds from depping on rpm/dpkg (and vice versa). .repositories in the air somewhat on this one. either indexablesequence, or a repositorySet. Nice aspect of the latter is you can just use .match with appropriate restrictions. very simply interface imo, although should provide a way to pull individual repositories/labels of said repos from the set though. basically, mangle a .raw_repo indexablesequence type trick (hackish, but nail it down when reach that bridge) build plan creation <TODO insert details as they’re fleshed out> 48 Chapter 3. Developer Notes pkgcore Documentation, Release trunk sets TODO chuck in some details here. probably defined via user config and/or profile, although what’s it define? atoms/restrictions? itermatch might be useful for a true set. build/setup operation (need a good name for this; dpkg/rpm/binpkg/ebuild’s ‘prepping’ for livefs merge should all fall under this, with varying use of the hooks) .build() do everything, calling all steps as needed .setup() whatever tmp dirs required, create ‘em. .req_files() (fetchables, although not necessarily with url (restrict=”fetch”...) .unpack() guess. .configure() unused till ebuild format version two (ya know, that overhaul we’ve been kicking around? :) .compile() guess. .test() guess. .install() install to tmp location. may not be used dependent on the format. .finalize() good to go. generate (jit?) contents/metadata attributes, or returns a finalized instance should generate a immutable package instance. repo change operation base class. .package package instance of what the action is centering around. .start() notify repo we’re starting (locking mainly, although prerm/preinst hook also) .finish() notify repo we’re done. .run() high level, calls whatever funcs needed. individual methods are mainly for ui, this is if you don’t display “doing install now... done... doing remove now... done” stuff. remove operation derivative of repo change operation. .remove() guess. .package package instance of what’s being yanked. install operation derivative of repo change operation .package what’s being installed. .install() install it baby. 3.1. Content 49 pkgcore Documentation, Release trunk merge operation derivative of repo remove and install (so it has .remove and .install, which must be called in .install and .remove order) .replacing package instance of what’s being replaced. .package what’s being installed fetchables basically a dict of stuff jammed together, just via attribute access (think c struct equiv) .filename .url tuple/list of url’s. .chksums dict of chksum:val fetcher hey hey. take a guess. worth noting, if fetchable lacks .chksums["size"], it’ll wipe any existing file. if size exists, and existing file is bigger, wipe file, and start anew, otherwise resume. mirror expansion occurs here, also. .fetch(fetchable, verifier=None) # if verifier handed in, does verification. verifier note this is basically lifted conceptually from mirror_dist. if wondering about the need/use of it, look at that source. verify() handed a fetchable, either False or True repository this should be format agnostic, and hide any remote bits of it. this is general info for using it, not designing a repository class .mergable() true/false. pass a pkg to it, and it reports whether it can merge that or not. .livefs boolean, indicative of whether or not it’s a livefs target- this is useful for resolver, shop it to other repos, binpkg fex prior to shopping it to the vdb for merging to the fs. Or merge to livefs, then binpkg it while continuing further building dependent on that package (ui app’s choice really). .raw_repo either it weakref’s self, or non-weakref refs another repo. why is this useful? visibility wrappers... this gives ya a way to see if p.mask is blocking usable packages fex. useful for the UI, not too much for pkgcore innards. .frozen boolean. basically, does it account for things changing without it’s knowledge, or does it not. frozen=True is faster for ebuild trees for example, single check for cache staleness. frozen=False is slower, and is what portage does now (meaning every lookup of a package, and instantiation of a package instance requires mtime checks for staleness). .categories IndexableSequence, if iterated over, gives ya all categories, if getitem lookup, sub-category category lookups. think media/video/mplayer 50 Chapter 3. Developer Notes pkgcore Documentation, Release trunk .packages IndexableSequence, if iterated over, all package names. if getitem (with category as key), packages of that category. .versions IndexableSequence, if iterated over, all cpvs. if getitem (with cat/pkg as key), versions for that cp .itermatch() iterable, given an atom/restriction, yields matching package instances. .match() def match(self, atom): return list(self.itermatch(atom)) voila. .__iter__() in other words, repository is iterable. yields package instances. .sync() sync, if the repo swings that way. flesh it out a bit, possibly handing in/back ui object for getting updates... digressing for a moment... note you can group repositories together, think portdir + portdir_overlay1 + portdir_overlay2. Creation of a repositoryset basically would involve passing multiple instantiating repo’s, and depending on that classes semantics, it internally handles the stacking (right most positional arg repo overrides 2nd right most, ... overriding left most) So... stating it again/clearly if it ain’t obvious, everything is configuration/instantiating of objects, chucked around/mangled by the pkgcore framework. What isn’t obvious is that since a repository set gets handed instantiated repositories, each repo, including the set instance, can should be able to have it’s own cache (this is assuming it’s ebuild repos through and through). Why? Cache data doesn’t change for the most part exempting which repo a cpv is from, and the eclass stacking. Handled individually, a cache bound to portdir should be valid for portdir alone, it shouldn’t carry data that is a result of eclass stacking from another overlay + that portdir. That’s the business of the repositoryset. Consequence of this is that the repositoryset needs to basically reach down into the repository it’s wrapping, get the pkg data, then rerequest the keys from that ebuild with a different eclass stack. This would be a bit expensive, although once inherit is converted to a pythonic implementation (basically handing the path to the requested eclass down the pipes to ebuild*.sh), it should be possible to trigger a fork in the inherit, and note python side that multiple sets of metadata are going to be coming down the pipe. That should alleviate the cost a bit, but it also makes multiple levels of cache reflecting each repository instance a bit nastier to pull off till it’s implemented. So... short version. Harring is a perfectionist, and says it should be this way. reality of the situation makes it a bit trickier. Anyone interested in attempting the mod, feel free, otherwise harring will take a crack at it since he’s being anal about having it work in such a fashion. Or... could do thus. repo + cache as a layer, wrapped with a ‘regen’ layer that handles cache regeneration as required. Via that, would give the repositoryset a way to override and use it’s own specialized class that ensures each repo gets what’s proper for it’s layer. Think raw_repo type trick. continuing on... cache ebuild centric, although who knows (binpkg cache ain’t insane ya know). short version, it’s functionally a dict, with sequence properties (iterating over all keys). .keys() return every cpv/package in the db. .readonly boolean. Is it modifiable? .match() Flesh this out. Either handed a metadata restriction (or set of ‘em), or handed dict with equiv info (like the former). ebuild caches most likely should return mtime information alongside, although maybe dependent on readonly. purpose of this? Gives you a way to hand off metadata searching to the cache db, rather then the repo having to resort to pulling each cpv from the cache and doing the check itself. This is what will make rdbms cache backends finally stop sucking and seriously rocking, properly implemented at least. :) clarification, you don’t call this directly, repo.match delegates off to this for metadata only restrictions 3.1. Content 51 pkgcore Documentation, Release trunk package this is a wrapped, constant package. configured ebuild src, binpkg, vdb pkg, etc. ebuild repositories don’t exactly and return this- they return unconfigured pkgs, which I’m not going to go into right now (domains only see this protocol, visibility wrappers see different) .depends usual meaning. ctarget depends .rdepends usual meaning. ctarget run time depends. seq, .bdepends see ml discussion. chost depends, what’s executed in building this (toolchain fex). seq. .files get a better name for this. doesn’t encompas files/*, but could be slipped in that way for remote. encompasses restrict fetch (files with urls), and chksum data. seq. .description usual meaning, although remember probably need a way to merge metadata.xml lond desc into the more mundane description key. .license usual meaning, depset .homepage usual. Needed? .setup() Name sucks. gets ya the setup operation, which does building/whatever. .data Raw data. may not exist, don’t screw with it unless you know what it is, and know the instance’s .data layout. .build() if this package is buildable, return a build operation, else return None restriction see layout.txt for more fleshed out examples of the idea. note, match and pmatch have been reversed namewise. .match() handed package instance, will return bool of whether or not this restriction matches. .cmatch() try to force the changes; this is dependent on the package being configurable. .itermatch() new one, debatable. short version, giving a sequence of package instances, yields true/false for them. why might this be desirable? if setup of matching is expensive, this gives you a way to amoritize the cost. might have use for glsa set target. define a restriction that limits to installed pkgs, yay/nay if update is avail... restrictionSet mentioning it merely cause it’s a grouping (boolean and/or) of individual restrictions an atom, which is in reality a category restriction, package restriction, and/or version restriction is a boolean and set of restrictions ContentsRestriction whats this you say? a restriction for searching the vdb’s contents db? Perish the thought! ;) metadataRestriction Mentioning this for the sake of pointing out a subclass of it, DescriptionRestriction- this will be a class representing matching against description data. See repo.match and cache.match above. The short version is that it encapsulates the description search (a very slow search right now) so that repo.match can hand off to the cache (delegation), and the cache can do the search itself, however it sees fit. 52 Chapter 3. Developer Notes pkgcore Documentation, Release trunk So... for the default cache, flat_list (19500 ebuilds == 19500 files to read for a full searchDesc), still is slow unless flat_list gets some desc. cache added to it internally. If it’s a sql based cache, the sql_template should translate the query into the appropriate select statement, which should make it much faster. Restating that, delegation is absolutely required. There have been requests to add intermediate caches to the tree, or move data (whether collapsing metadata.xml or moving data out of ebuilds) so that the form it is stored is in quicker to search. These approaches are wrong. Should be clear from above that a repository can, and likely will be remote on some boxes. Such a shift of metadata does nothing but make repository implementations that harder, and shift power away from what knows best how to use it. Delegation is a massively more powerful approach, allowing for more extensibility, flexibility and speed. Final restating- searchDesc is matching against cache data. The cache (whether flat_list, anydbm, sqlite, or a remote sql based cache) is the authority about the fastest way to do searches of it’s data. Programmers get pist off when users try and tell them how something internally should be implemented- it’s fundamentally the same scenario. The cache class the user chooses knows how to do it’s job the best, provide methods of handing control down to it, and let it do it’s job (delegation). Otherwise you’ve got a backseat driver situation, which doesn’t let those in the know, do the deciding (cache knows, repo doesn’t). Mind you not trying to be harsh here. If in reading through the full doc you disagree, question it; if after speeding up current cache implementation, note that any such change must be backwards compatible, and not screw up the possibilities of encapsulation/delegation this design aims for. logging flesh this out (define this basically). short version, no more writemsg type trickery, use a proper logging framework. ebuild-daemon.sh Hardcoded paths have to go. /usr/lib/portage/bin == kill it. Upon initial loadup of ebuild.sh, dump the default/base path down to the daemon, including a setting for /usr/lib/portage/bin . Likely declare -xr it, then load the actual ebuild*.sh libs. Backwards compatibility for that is thus, ebuild.sh defines the var itself in global scope if it’s undefined. Semblence of backwards compatibility (which is actually somewhat pointless since I’m about to blow it out of the water). Ebuild-daemon.sh needs a function for dumping a _large_ amount of data into bash, more then just a line or two. For the ultra paranoid, we load up eclasses, ebuilds, profile.bashrc’s into python side, pipe that to gpg for verification, then pipe that data straight into bash. No race condition possible for files used/transferred in this manner. A thought. The screw around speed up hack preload_eclasses added in ebd’s heyday of making it as fast as possible would be one route; Basically, after verification of an elib/eclass, preload the eclass into a func in the bash env. and declare -r the func after the fork. This protects the func from being screwed with, and gives a way to (at least per ebd instance) cache the verified bash code in memory. It could work surprisingly enough (the preload_eclass command already works), and probably be fairly fast versus the alternative. So... the race condition probably can be flat out killed off without massive issues. Still leaves a race for perms on any files/*, but neh. A) That stuff shouldn’t be executed, B) security is good, but we can’t cover every possibility (we can try, but dimishing returns) A lesser, but still tough version of this is to use the indirection for actual sourcing to get paths instead. No EBUILD_PATH, query python side for the path, which returns either ‘’ (which ebd interprets as “err, something is whacked, time to scream”), or the actual path. In terms of timing, gpg verification of ebuilds probably should occur prior to even spawning ebd.sh. profile, eclass, and elib sourcing should use this technique to do on the fly verification though. Object interaction for that one is going to be really fun, as will be mapping config settings to instantiation of objs. 3.1. Content 53 pkgcore Documentation, Release trunk 54 Chapter 3. Developer Notes CHAPTER 4 Indices and tables • genindex • modindex • search 55
© Copyright 2025