pkgcore Documentation Release trunk Brian Harring, Marien Zwart, Tim Harder October 25, 2014

pkgcore Documentation
Release trunk
Brian Harring, Marien Zwart, Tim Harder
October 25, 2014
Contents
1
API Documentation
1.1 Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
3
2
Man Pages
2.1 Installed Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
7
3
Developer Notes
3.1 Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
9
4
Indices and tables
55
i
ii
pkgcore Documentation, Release trunk
Contents:
Contents
1
pkgcore Documentation, Release trunk
2
Contents
CHAPTER 1
API Documentation
1.1 Modules
pkgcore
pkgcore.binpkg
pkgcore.binpkg.remote
pkgcore.binpkg.repo_ops
pkgcore.binpkg.repository
pkgcore.binpkg.xpak
pkgcore.cache
pkgcore.cache.errors
pkgcore.cache.flat_hash
pkgcore.cache.fs_template
pkgcore.cache.metadata
pkgcore.config
pkgcore.config.basics
pkgcore.config.central
pkgcore.config.cparser
pkgcore.config.dhcpformat
pkgcore.config.domain
pkgcore.config.errors
pkgcore.config.mke2fsformat
pkgcore.const
pkgcore.ebuild
pkgcore.ebuild.atom
pkgcore.ebuild.restricts
pkgcore.ebuild.conditionals
pkgcore.ebuild.const
pkgcore.ebuild.cpv
pkgcore.ebuild.digest
pkgcore.ebuild.domain
pkgcore.ebuild.ebd
pkgcore.ebuild.ebuild_built
pkgcore.ebuild.ebuild_src
pkgcore.ebuild.eclass_cache
pkgcore.ebuild.errors
pkgcore.ebuild.filter_env
pkgcore.ebuild.formatter
Continued on next page
3
pkgcore Documentation, Release trunk
Table 1.1 – continued from previous page
pkgcore.ebuild.misc
pkgcore.ebuild.portage_conf
pkgcore.ebuild.processor
pkgcore.ebuild.profiles
pkgcore.ebuild.repo_objs
pkgcore.ebuild.repository
pkgcore.ebuild.resolver
pkgcore.ebuild.triggers
pkgcore.fetch
pkgcore.fetch.base
pkgcore.fetch.custom
pkgcore.fetch.errors
pkgcore.fs
pkgcore.fs.contents
pkgcore.fs.fs
pkgcore.fs.livefs
pkgcore.fs.ops
pkgcore.fs.tar
pkgcore.gpg
pkgcore.log
pkgcore.merge
pkgcore.merge.const
pkgcore.merge.engine
pkgcore.merge.errors
pkgcore.merge.triggers
pkgcore.operations
pkgcore.operations.domain
pkgcore.operations.format
pkgcore.operations.observer
pkgcore.operations.repo
pkgcore.os_data
pkgcore.package
pkgcore.package.base
pkgcore.package.conditionals
pkgcore.package.errors
pkgcore.package.metadata
pkgcore.package.mutated
pkgcore.package.virtual
pkgcore.pkgsets
pkgcore.pkgsets.filelist
pkgcore.pkgsets.glsa
pkgcore.pkgsets.installed
pkgcore.pkgsets.system
pkgcore.plugin
pkgcore.repository
pkgcore.repository.configured
pkgcore.repository.errors
pkgcore.repository.misc
pkgcore.repository.multiplex
pkgcore.repository.prototype
pkgcore.repository.syncable
pkgcore.repository.util
Continued on next page
4
Chapter 1. API Documentation
pkgcore Documentation, Release trunk
Table 1.1 – continued from previous page
pkgcore.repository.virtual
pkgcore.repository.visibility
pkgcore.repository.wrapper
pkgcore.resolver
pkgcore.resolver.choice_point
pkgcore.resolver.pigeonholes
pkgcore.resolver.plan
pkgcore.resolver.state
pkgcore.resolver.util
pkgcore.restrictions
pkgcore.restrictions.boolean
pkgcore.restrictions.delegated
pkgcore.restrictions.packages
pkgcore.restrictions.restriction
pkgcore.restrictions.util
pkgcore.restrictions.values
pkgcore.scripts
pkgcore.scripts.filter_env
pkgcore.scripts.pclone_cache
pkgcore.scripts.pconfig
pkgcore.scripts.pebuild
pkgcore.scripts.pinspect
pkgcore.scripts.pmaint
pkgcore.scripts.pmerge
pkgcore.scripts.pplugincache
pkgcore.scripts.pquery
pkgcore.spawn
pkgcore.sync
pkgcore.sync.base
pkgcore.sync.bzr
pkgcore.sync.cvs
pkgcore.sync.darcs
pkgcore.sync.git
pkgcore.sync.hg
pkgcore.sync.rsync
pkgcore.sync.svn
pkgcore.system
pkgcore.system.libtool
pkgcore.util
pkgcore.util.commandline
pkgcore.util.file_type
pkgcore.util.packages
pkgcore.util.parserestrict
pkgcore.util.repo_utils
pkgcore.vdb
pkgcore.vdb.contents
pkgcore.vdb.ondisk
pkgcore.vdb.repo_ops
pkgcore.vdb.virtuals
pkgcore.version
1.1. Modules
5
pkgcore Documentation, Release trunk
6
Chapter 1. API Documentation
CHAPTER 2
Man Pages
Pkgcore installs a set of scripts for installing/removing packages, and doing various system maintenance related operations. The man pages for each command follow.
2.1 Installed Commands
7
pkgcore Documentation, Release trunk
8
Chapter 2. Man Pages
CHAPTER 3
Developer Notes
These are the original docs written for pkgcore, detailing some of it’s architecture, intentions, and reasons behind
certain designs.
Currently, the docs aren’t accurate- this will be corrected moving forward.
Right now they’re primarily useful from a background-info standpoint.
3.1 Content
3.1.1 Rough TODO
• rip out use.* code from pkgcore_checks.addons.UseAddon.__init__,
core.ebuild.repository
and generalize it into pkg-
• not hugely important, but... make a cpython version of SlottedDict from pkgcore.util.obj; 3% reduction for full
repo walk, thus not a real huge concern atm.
• userpriv for pebuild misbehaves..
• http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/491285 check into, probably better then my crufty
itersort; need to see how well heapqu’s nlargest pop behaves (looks funky)
• look into converting MULTILIB_STRICT* crap over to a trigger
• install-sources trigger
• recreate verify-rdepends also
• observer objects for reporting back events from merging/unmerging cpython ‘tee’ is needed, contact harring for
details. basic form of it is in now, but need something more powerful for parallelization elog is bound to this
also
• Possibly convert to cpython:
– flat_hash.database._parse_data
– metadata.database._parse_data
– posixpath (os.path)
• get the tree clean of direct /var/db/pkg access
• vdb2 format (ask harring for details).
9
pkgcore Documentation, Release trunk
• pkgcore.fs.ops.merge_contents; doesn’t rewrite the contents set when a file it’s merging is relying on symlinked directories for the full path; eg, /usr/share/X11/xkb/compiled -> /var/blah, it records the former instead
of recording the true absolute path.
• pmerge mods; [ –skip-set SET ] , [ –skip atom ], use similar restriction to –replace to prefer vdb for matching
atoms
• refactor pkgcore.ebuild.cpv.ver_cmp usage to avoid full cpv parsing when _cpv is in use; ‘nuff said, look in
pkgcore.ebuild.cpv.cpy_ver_cmp
• testing of fakeroot integration
it was working back in the ebd branch days; things have changed since then (heavily), enabling/disabling should
work fine, but will need to take a look at the contentset generation to ensure perms/gid leaks through correctly.
• modify repository.prototype.tree.match to take an optional comparison
reasoning being that if we’re just going to do a max, pass in the max so it has the option of doing the initial
sorting without passing through visibility filters (which will trigger metadata lookups)
• ‘app bundles’. Reliant on serious overhauling of deps to do ‘locked deps’, but think of it as rpath based app
stacks, a full apache stack compiled to run from /opt/blah for example.
• pkgcore.ebuild.gpgtree
derivative of pkgcore.ebuild.ebuild_repository, this overloads ebuild_factory and eclass_cache so that gpg
checks are done. This requires some hackery, partially dependent on config.central changes (see above). Need
a way to specify the trust ring to use, ‘severity’ level (different class targets works for me). Anyone who implements this deserves massive cookies.
• pkgcore.ebuild.gpgprofile: Same as above.
• reintroduce locking of certain high level components using read/write; mainly, use it as a way to block sync’ing
a repo that’s being used to build, lock the vdb for updates, etc.
• preserve xattrs when merging files to properly support hardened profiles
• support standard emerge.log output so tools such as qlop work properly
• add FEATURES=parallel-fetch support for downloading distfiles in the background while building pkgs, possibly extend to support parallel downloads
• apply repo masks to related binpkgs (or handle masks somehow)
• remove deprecated PROVIDE and old style virtuals handling
• add argparse support for checking the inputted phase name with pebuild to make sure it exists, currently nonexistent input cause unhandled exceptions
• allow pebuild to be passed ebuild file paths in addition to its current atom handling, this should work similar to
how portage’s ebuild command operates
• support repos.conf (SYNC is now deprecated)
• make profile defaults (LDFLAGS) override global settings from /usr/share/portage/config/make.globals or similar then apply user settings on top, currently LDFLAGS is explicitly set to an empty string in make.globals but
the profile settings aren’t overriding that
• support /etc/portage/mirrors
• support ACCEPT_PROPERTIES and /etc/portage/package.properties
• support ACCEPT_RESTRICT and /etc/portage/package.accept_restrict
• support pmerge –info (emerge –info workalike), requires support for info_vars and info_pkgs files from profiles
10
Chapter 3. Developer Notes
pkgcore Documentation, Release trunk
3.1.2 Changes
(Note that this is not a complete list)
• Proper env saving/reloading. The ebuild is sourced once, and run from the env.
• DISTDIR has indirection now. It points at a directory, ie, symlinks. to the files. The reason for this is to prevent
builds from lying about their sources, leading to less bugs.
• PORTAGE_TMPDIR is no longer in the ebuild env.
• (PORTAGE_|)BUILDDIR is no longer in the ebuild env.
• BUILDPREFIX is no longer in the ebuild env.
• AA is no longer in the ebuild env.
• inherit is an error in phases except for setup, prerm, and postrm. pre/post rm are allowed only in order to deal
with broken envs. Running config with a broken env isn’t allowed, because config won’t work; installing with a
broken env is not allowed because preinst/postinst won’t be executed.
• binpkg building now gets the unmodified contents- thus when merging a binpkg, all files are there unmodified.
3.1.3 Commandline framework
Overview
pkgcore’s own commandline tools and ideally also most external tools use a couple of utilities from pkgcore.util.commandline to enforce a consistent interface and reduce boilerplate. There are also some helpers for writing
tests for scripts using the utilities. Finally, pkgcore’s own scripts are started through a single wrapper (just to reduce
boilerplate).
Writing a script
Whether your script is intended for inclusion with pkgcore itself or not, the first things you should write are a commandline.OptionParser subclass (unless your script takes no commandline arguments) and a main function. The OptionParser is a lightly customized optparse.OptionParser, so the standard optparse documentation applies. Differences
include:
• A couple of standard options and defaults are added. Some of this uses __init__.py, so if you override that
(which you will) remember to call the base class (with any keyword arguments you received).
• The “Values” object used is a subclass, with a “config” property that autoloads the user’s configuration. You
should access this as late as possible for a more responsive ui.
• check_values applies some minor cleanups, see the module for details. Remember to call the base method (you
will usually want to do some things here).
The “main” function takes an optparse “values” object generated by your option parser and two pkgcore.util.formatters.Formatter instances, one for stdout and one for stderr. This one should do the actual work your
script does.
The return value of the main function is your script’s exit status. Returning None is the same thing as returning 0
(success).
If you have used optparse before you might wonder why main only receives an optparse values object, not the remaining arguments. This is handled a bit differently in pkgcore: if you handle arguments you should sanity-check them
in check_values and store them on the values object. check_values should always return an empty tuple as second
3.1. Content
11
pkgcore Documentation, Release trunk
argument, either because no arguments were passed or because they were all accepted by check_values. We believe
this makes more sense, since it stores everything learned from the commandline on a single object.
All output has to go through the formatter. If you use “print” directly the formatter will lose track of where it is in
the line, which will cause weird output if you use the “wrap” option of the formatter. The test helpers also rely on all
output going through the formatters.
To actually run your script you call pkgcore.util.commandline.main (do not confuse this with your own script’s main
function, the two are quite different). The simplest (and most common) call is commandline.main({None:
(yourscript.OptionParser, yourscript.main)}). The weird dict is used for subcommands. The
recommended place to put this call is in a tiny script that just imports your actual script module and calls commandline.main. Making your script an actual module you can import means it can be tested (and it can be useful in
interactive python or for quick hacky scripts).
commandline.main takes care of a couple of things, including setting up a reporter for the standard library’s logging
package and swallowing exceptions from the configuration system. It does not swallow any other exceptions your
script might raise (although this might become an option in the future).
check_values and main: what goes where
The idea (as you can guess from the names) is that check_values makes sure everything passed on the commandline
makes sense, but no more than that.
• The best way to report incorrect commandline parameters is by calling error("error message goes
here") on the option parser. You cannot do this from main, since it has no access to the option parser. Please
do not try to print something similar through the err formatter here, shift the code to check_values.
• check_values does not have access to the out or err formatter. The only way it should “communicate” is through
the error (or possibly exit) methods. If you want to produce different kinds of output, do it in main. (it is possible
the option parser will grow a warning method at some point, if this would be useful let us know (file a trac
ticket).
• Use common sense. If it is part of your script’s main task it should be in main. If it changes the filesystem it
should definitely be in main.
Subcommands
The main function recently gained some support for subcommands (which you probably know from most version
control systems). If you find yourself trying to reimplement this kind of interface with optparse, or one similar to
emerge with a couple of mutually exclusive switches selecting a mode (–depclean, –sync etc.) then you should try
using this subcommand system instead.
To use it, simply define a separate OptionParser and main function for every subcommand and use the subcommand
name as the key in the dict passed to commandline.main. The key None used for “no subcommand” can still be used
too, but this is probably not a good idea.
If there is no parser/main pair with the key None and an unrecognized subcommand is passed (including --help)
an overview of subcommands is printed. This uses the docstring of the __main__ function, so put something useful
there. If there is a None parser you should include the valid subcommands in its --help output, since there is no
way to get at commandline.main’s autogenerated subcommand help if a None parser is present.
pwrapper
Because having a dozen of different scripts each just calling commandline.main would be silly pkgcore’s own scripts
are all symlinks to a single wrapper which imports the right actual script based on the sys.argv[0] it is called with
12
Chapter 3. Developer Notes
pkgcore Documentation, Release trunk
and runs it. The script module needs to define either a commandline_commands dict (for a script with subcommands)
or a class called OptionParser and function called main for this to work.
The script used in the source tree also takes care of inserting the right pkgcore package on sys.path. Installed pkgcore
uses a different wrapper without this magic.
If you write a new script that should go into pkgcore itself, use the wrapper. If you maintain it externally and do not
have a lot of scripts, don’t bother duplicating this wrapper system. Don’t bother duplicating the path manipulation
either: if you put your script in the same directory your actual package or module is in (no separate “bin” directory)
and don’t run it as root no path manipulation is required.
Tests
Because additions to the default options pkgcore uses can make your script unrunnable it is critical to have at least
rudimentary tests that just instantiate your parser. Because optparse defaults to calling sys.exit for a parse failure and the pkgcore version also likes to load the user’s configuration files, writing those tests is slightly tricky.
pkgcore.test.scripts.helpers tries to make it easier. It contains a mangle_parser function that takes an
OptionParser instance and makes it raise exceptions instead of exiting. It also contains a mixin with some extra assert methods that check if your option parser and main function have the desired effect on various arguments and
configurations. See the docstrings for more information.
3.1.4 Config use and implementation notes
Using the manager
Normal use
To get at the user’s configuration:
from pkgcore.config import load_config
config = load_config()
main_repo = config.get_default(’repo’)
spork_repo = config.repo[’spork’]
Usually this is everything you need to know about the manager. Some things to be aware of:
• Some of the managed sources of configuration data may be slow, so accessing a source is delayed for as long as
possible. Some things require accessing all sources though and should therefore be avoided. The easiest one to
trigger is config.repo.keys() or the equivalent list(config.sections(‘repo’)). This has to get the “class” setting for
every available config section, which might be slow.
• For the same reason the manager does not know what type names exist (there is no hardcoded list of them, so
the only way to get that information would be loading all config sections). This is why you can get this:
>>> load_config().section(’repo’) # typo, should be "sections"
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: ’_ConfigMapping’ object is not callable
This constructed a dictlike object for accessing all config sections of the type “section”, then tried to call it.
Testcase use
For testing of high-level scripts it can be useful to construct a config manager containing hardcoded values:
3.1. Content
13
pkgcore Documentation, Release trunk
from pkgcore.config import basics, central
config = central.ConfigManager([{
’repo’ = basics.HardCodedConfigSection({’class’: my_repo,
’data’: [’1’, ’2’]}),
’cont’ = basics.ConfigSectionFromStringDict({’class’: ’pkgcore.my.cont’,
’ref’: ’repo’}),
}])
What this does should be fairly obvious. Be careful you do not use the same ConfigSection object in more than one
place: caching will not behave the way you want. See Adding a config source for details.
Adding a configurable
You often do not really have to do anything to make something a valid “class” value, but it is clearer and it is necessary
in certain cases.
Adding a class
To make a class available, do this:
from pkgcore.config import ConfigHint, errors
class MyRepo(object):
pkgcore_config_type = ConfigHint({’cache’: ’section_ref’},
typename=’repo’)
def __init__(self, repo):
try:
self.initialize(repo)
except SomeRandomException:
raise errors.InstantiationError(’eep!’)
The first ConfigHint arg tells the config system what kind of arguments you take. Without it it assumes arguments with
no default are strings and guesses for the other args based on the type of the default value. So if you have no default
values or they are just None you should tell the system about your args.
The second one tells it you fulfill the repo “protocol”, meaning your instances will show up in load_config().repo.
ConfigHint takes some more arguments, see the api docs for details.
Adding a callable
To make a callable available you can do this:
from pkgcore.config import configurable, errors
@configurable({’cache’: ’section_ref’}, typename=repo)
def my_repo(repo):
# do stuff
configurable is just a convenience function that applies a ConfigHint.
14
Chapter 3. Developer Notes
pkgcore Documentation, Release trunk
Exception handling
If you raise an exception when the config system calls you it will catch the exception and wrap it in an InstantiationError. This is good for calling code since catching and printing those provides the user with a readable description of
what happened. It is less good for developers since the raising of a new exception kills the traceback printed in debug
mode. You will have a traceback that “ends” in the config code handling instantiation.
You can improve this by raising an InstantiationError yourself. If you do this the config system will be able to add the
extra information needed for a user-friendly error message to it without raising a new exception, meaning debug mode
will give a traceback leading right back to your code raising the InstantiationError.
Adding a config source
Config sources are pretty straightforward: they are mappings from a section name to a ConfigSection subclass. The
only tricky thing is the combination of section references and caching. The general rule is “do not expose the same
ConfigSection in more than one way”. If you do it will be collapsed and instantiated once for every way it is exposed,
which is usually not what you want. An example:
from pkgcore.config import basics, configurable
def example():
return object()
@configurable({’ref’: ’section_ref’})
def nested(ref):
return ref
multi = basics.HardCodedConfigSection({’class’: example})
myconf = {
’multi’: multi,
’bad’: basics.HardCodedConfigSection({’class’: nested, ’ref’: multi})
’good’: basics.ConfigSectionFromStringDict({’class’: ’nested’,
’ref’: ’multi’})
If you feed this to the ConfigManager and instantiate everything “multi” and “good” will be identical but “bad” will
be a different object. For an explanation of why this happens see the implementation notes in the next section.
You trigger a similar problem if you create a custom ConfigSection subclass that bypasses central’s collapse_named_section for named section refs. If you somehow get at the referenced ConfigSection and hand it to
collapse_section you will most likely circumvent caching. Only use collapse_section for unnamed sections.
ConfigManager tries not to extract more things from this mapping than it has to. Specifically, it will not call
__getitem__ before it needs to instantiate the section or needs to know its type. However it will iterate over the
keys (section names) immediately to find autoloads. If this is a problem (getting those names is slow) then make sure
the manager knows your config is “remote”.
Implementation notes
This code has evolved quite a bit over time. The current code/design tries among other things to:
• Allow sections to contain both named and nameless/inline references to other sections.
• Allow serialization of the loaded config.
• Not do unnecessary work (if possibly not recollapse configs, definitely not trigger unnecessary imports, access
configs unnecessarily, reinstantiate configs)
3.1. Content
15
pkgcore Documentation, Release trunk
• Provide both end-user error messages that are complete enough to track down a problem in a complex nested
config and tracebacks that reach back to actual buggy code for developers.
Overview from load_config() to instantiated repo
When you call load_config() it looks up what config files are available (/etc/pkgcore.conf, ~/.pkgcore.conf,
/etc/make.conf) and loads them. This produces a dict mapping section names to ConfigSection instances. For
the ini-format pkgcore.conf files this is straightforward, for make.conf this is a lot of work done in pkgcore.config.portage_conf. I’m not going to describe that module here, read the source for details.
The ConfigSections have a pretty straightforward api: they work like dicts but get passed a string describing what
“type” the value should be and a central.ConfigManager instance for reasons described later. Passing in this “type”
string when getting the value is necessary because the way things like lists of strings are stored depends on the format
of the configuration file but the parser does not have enough information to know it should parse as a list instead of a
string. For example, an ini-format pkgcore.conf could contain:
[my-overlay-cache]
class=pkgcore.cache.flat_hash.database
auxdbkeys=DEPEND RDEPEND
We want to turn that auxdbkeys value into a list of strings in the ini file parser code instead of in the central.ConfigManager or even higher up because more exotic config sections may want to store this in a different way
(perhaps as a comma-separated list, or even as “<el>DEPEND</el><el>RDEPEND</el>”. But there is obviously not
enough information in the ini file for the parser to know this is meant as a list instead of a string with a space in it.
central.ConfigManager gets instantiated with one or more of those dicts mapping section names to ConfigSections.
They’re split up into normal and “remote” configs which I’ll describe later, let’s assume they’re all “remote” for now.
In that case no work is done when the ConfigManager is instantiated.
Getting an actual configured object out of the ConfigManager is split in two phases. First the involved config sections
are “collapsed”: inherits are processed, values are converted to the right type, presence of required arguments is
checked, etc. Everything up to actually instantiating the target class and actually instantiating any section references
it needs. The result of this work is bundled in a CollapsedConfig instance. Actual instantiation is handled by the
CollapsedConfig instance.
The ConfigManager manages CollapsedConfig instances. It creates new ones if required and makes sure that if a
cached instance is available it is used.
For the remainder of the example let’s assume our config looks like this:
[spork]
inherit=cache
auxdbkeys=DEPEND RDEPEND
[cache]
class=pkgcore.cache.flat_hash.database
Running config.repo[’spork’] runs config.collapse_named_section(‘spork’). This first checks if this section was already collapsed and returns the CollapsedConfig if it is available. If it is not in the cache it looks up the ConfigSection
with that name in the dicts handed to the ConfigManager on instantiation and calls collapse_section on it.
collapse_section first recursively finds any inherited sections (just the “cache” section in this case). It then grabs
the ‘class’ setting (which is always of type ‘callable’). In this case that’s “pkgcore.cache.flat_hash.database”,
which the ConfigSection imports and returns. This is then wrapped in a config.basics.ConfigType. A ConfigType contains the information necessary to validate arguments passed to the callable. It uses the magic
pkgcore_config_type attribute if the callable has it and introspection for everything else. In this case pkgcore.cache.flat_hash.database.pkgcore_config_type is a ConfigHint stating the “auxdbkeys” argument is of type “list”.
16
Chapter 3. Developer Notes
pkgcore Documentation, Release trunk
Now that collapse_section has a ConfigType it uses it to retrieve the arguments from the ConfigSections and passes
the ConfigType and arguments to CollapsedConfig’s __init__. Then it returns the CollapsedConfig instance to collapse_named_section. collapse_named_section caches it and returns it.
Now we’re back in the __getattr__ triggered by config.repo[’spork’]. This checks if the ConfigType on the CollapsedConfig is actually ‘repo’, and returns collapsedConfig.instantiate() if this matches.
Lazy section references
The main reason the above is so complicated is to support various kinds of references to other sections. Example:
[spork]
class=pkgcore.Spork
ref=foon
[foon]
class=pkgcore.Foon
Let’s say pkgcore.Spork has a ConfigHint stating the type of the “ref” argument is “lazy_ref:foon” (lazy reference
to a foon) and its typename is “repo”, and pkgcore.Foon has a ConfigHint stating its typename is “foon”. a “lazy
reference” is an instance of basics.LazySectionRef, which is an object containing just enough information to produce
a CollapsedConfig instance. This is not the most common kind of reference, but it is simpler from the config point of
view so I’m describing this one first.
When collapse_section runs on the “spork” section it calls section.get_value(self, ‘ref:repo’, ‘section_ref’). “lazy_ref”
in the type hint is converted to just “ref” here because the ConfigSections do not have to distinguish between lazy
and “normal” references. Because this particular ConfigSection only supports named references it returns a LazyNamedSectionRef(central, ‘ref:repo’, ‘foon’). This just gets handed to Spork’s __init__. If the Spork decides to call
instantiate() on the LazyNamedSectionRef it calls central.collapse_named_section(‘foon’), checks if the result is of
type foon, instantiates it and returns it.
The same thing using a dhcp-style config:
spork {
class pkgcore.Spork;
ref {
class pkgcore.Foon;
};
}
In this format the reference is an inline unnamed section. When get_value(central, ‘ref:repo’, ‘foon’) is called it
returns a LazyUnnamedSectionRef(central, ‘ref:repo’, section) where section is a ConfigSection instance for the nested
section (knowing just that “class” is “pkgcore.Foon” in this case). This is handed to Spork.__init__. If Spork calls
instantiate() on it it calls central.collapse_section(self.section) and does the same type checking and instantiating
LazyNamedSectionRef did.
Notice neither Spork nor ConfigManager care if the reference is inline or named. get_value just has to return a
LazySectionRef instance (LazyUnnamedSectionRef and LazyNamedSectionRef are subclasses of this). How this
actually gets a referenced config section is up to the ConfigSection whose get_value gets called.
Normal section references
If Spork’s ConfigHint defines the type of its “ref” argument as “ref:foon” instead of “lazy_ref:foon” it gets handed
an actual Foon instance instead of a LazySectionRef to one. This is built on top of the lazy reference code. For
the ConfigSections nothing changes (the same get_value call is made). But the ConfigManager now immediately
3.1. Content
17
pkgcore Documentation, Release trunk
calls collapse() on the LazySectionRef, retrieving a CollapsedConfig instance (for the “foon”). This is handed to the
CollapsedConfig for “spork”, and when this one is instantiated the referenced CollapsedConfig is also instantiated.
Miscellaneous details
The support for nameless sections means neither ConfigSection nor CollapsedConfig have a name attribute. This
makes the error handling code a bit tricky as it has to tag in the name at various points, but this works better than
enforcing names where it does not make sense (means lots of unnecessary duplication of names when dealing with
dicts of HardCoded/StringBasedConfigSections).
The suppport for serialization of the loaded config means section_refs cannot be instantiated straight away. The object
used for serialization is the CollapsedConfig which gives you both the actual values and the type they have. If the
CollapsedConfig contained arbitrary instantiated objects serializing them would be impossible. So it contains nested
CollapsedConfigs instead.
Not doing unnecessary work is done by caching in two places. The simple one is CollapsedConfig caching its instantiated value. This is pretty straightforward. The more subtle one is ConfigManager caching CollapsedConfigs
by name. It is obviously a good idea to cache these (if we didn’t we would have to cache the instantiated value in
the ConfigManager). An alternative would be caching them by their ConfigSection. This has the minor disadvantage
of keeping the ConfigSection in memory, and the larger one that it may break caching for weird config sources that
generate ConfigSections on demand. The downside of caching by name is we have to make sure nothing generates
a CollapsedConfig for a named section in a way other than collapse_named_section (handing the ConfigSection to
collapse_section bypasses caching).
This means a ConfigSection cannot return a raw ConfigSection from a section_ref get_value call. If it was a ConfigSection that central then collapsed and the reference was actually to a named section caching is bypassed.
The need for a section name starting with “autoload” is also there to avoid unnecessary work. Without this we would
have to figure out the typename of every section. While we can do that without entirely collapsing the config we
cannot avoid importing the “class”, which means load_config() would import most of pkgcore. That should definitely
be avoided.
3.1.5 Checking the source out
If you’re just installing pkgcore from a released tarball, skip this section.
To get the current (development) code with history, install git_ (emerge git on gentoo) and run:
git clone git://pkgcore.org/pkgcore
3.1.6 Installing pkgcore
Set PYTHONPATH
If you only want to run scripts from pkgcore itself (the ones in its “bin” directory) you do not have to do anything
with PYTHONPATH. If you want to use pkgcore from an interactive python interpreter session you do not have to do
anything if you start the interpreter from the “root” of the pkgcore source tree. For other uses you probably want to set
PYTHONPATH to include your pkgcore directory, so that python can find the pkgcore code. For example:
$ export PYTHONPATH="${PYTHONPATH}:/home/user/pkgcore/"
Now test to see if it works:
18
Chapter 3. Developer Notes
pkgcore Documentation, Release trunk
$ python -c ’import pkgcore’
Python will scan pkgcore, see the pkgcore directory in it (and that it has __init__.py), and use that.
Registering plugins
Pkgcore uses plugins for some basic functionality. You do not really have to do anything to get this working, but things
are a bit faster if the plugin cache is up to date. This happens automatically if the cache is stale and the user running
pkgcore may write there, but if pkgcore is installed somewhere system-wide and you only run it as user you can force
a regeneration with:
# pplugincache
If you want to update plugin caches for something other than pkgcore’s core plugin registry, pass the package name as
an argument.
Test pkgcore
Drop back to normal user, and try:
$ python
>>> import pkgcore.config
>>> from pkgcore.ebuild.atom import atom
>>> conf=pkgcore.config.load_config()
>>> tree=conf.get_default(’domain’).repos[1]
>>> pkg=max(tree.itermatch(atom("dev-util/diffball")))
>>> print pkg
>>> print pkg.depends
>=dev-libs/openssl-0.9.6j >=sys-libs/zlib-1.1.4 >=app-arch/bzip2-1.0.2
At the time of writing the domain interface is in flux, so this example might fail for you. If it doesn’t work ask for
assistance in #pkgcore on freenode, or email ferringb (at) gmail.com’ with the traceback.
Build extensions
If you want to run pkgcore from its source directory but also want the extra speed from the compiled extension
modules, compile them in place:
$ python setup.py build_ext -i
3.1.7 Ebuild EAPI
This should hold the proposed (with a chance of making it in), accepted, and implemented changes for ebuild format
version 1. A version 0 doc would also be a good idea ( no one has volunteered thus far ).
Version 0 (or undefined eapi, <=portage-2.0.52*)]
Version 1
This should be fairly easy stuff to implement for the package manager, so this can actually happen in a fairly short
timeframe.
• EAPI = 1 required
3.1. Content
19
pkgcore Documentation, Release trunk
• src_configure phase is run before src_compile. If the ebuild or eclass does not override there is a default that does
nothing. Things like econf should be run in this phase, allowing rerunning the build phase without rerunning
configure during development.
• Make the default implementation of phases/functions available under a second name (possibly using EXPORT_FUNCTIONS) so you can call base_src_compile from your src_compile.
• default src_install. Exactly what goes in needs to be figured out, see bug 33544.
• RDEPEND=”${RDEPEND-${DEPEND}}” is no longer set by portage, same for eclass.
• (proposed) BDEPEND metadata addition, maybe. These are the dependencies that are run on the build system
(toolchain, autotools etc). Useful for ROOT != “/”. Probably hard to get right for ebuild devs who always have
ROOT=”/”.
• default IUSE support, IUSE=”+gcj” == USE=”gcj” unless the user disables it.
• GLEP 37 (“Virtuals Deprecation”), maybe. The glep is “deferred”. How much of this actually needs to be done?
package.preferred?
• test depend, test src_uri (or represent test in the use namespace somehow).
TEST_{SRC_URI,{B,R,}DEPEND}, test “USE” flag getting set by FEATURES=test.
Possibilities:
• drop AA (unused).
• represent in metadata if the pkg needs pkg_preinst to have access to ${D} or not. If this is not required a binpkg
can be unpacked straight to root after pkg_preinst. If pkg_preinst needs access to ${D} the binpkg is unpacked
there as usual.
• use groups in some form (kill use_expand off).
• ebuilds can no longer use PORTDIR and ECLASSDIR(s); they break any potential remote, and are dodgey as
all hell for multiple repos combined together.
• disallow direct access to /var/db/pkg
• deprecate ebuild access/awareness of PORTAGE_* vars; perl ebuilds security fix for PORTAGE_TMPDIR
(rpath stripping in a way) might make this harder.
• use/slot deps, optionally repository deps.
• hard one to slide in, but change versioning rules; no longer allow 1.006, require it to be 1.6
• pkg_setup must be sandboxable.
• allowed USE conditional configurations; new metadata key, extend depset syntax to include xor, represent
allowed configurations.
• true incremental stacking support for metadata keys between eclasses/ebuilds; RESTRICT=-strip for example
in the ebuild.
• drop -* from keywords; it’s package.masking, use that instead (-arch is acceptable although daft)
• blockers aren’t allowed in PDEPEND (the result of that is serious insanity for resolving)
Version 1+
Not sure about these. Maybe some can go into version 1, maybe they will happen later.
• Elibs
• some way to ‘bind’ a rdep/pdep so that it’s explicit “I’m locked against the version I was compiled against”
• some form of optional metadata specifying that a binpkg works on multiple arches, iow it doesn’t rely on
compiled components.
20
Chapter 3. Developer Notes
pkgcore Documentation, Release trunk
• A way to move svn/cvs/etc source fetching over to the package manager. The current way of doing this through
an eclass is a bit ugly since it requires write access to the distdir. Moving it to the package manager fixes that
and allows integrating it with things like parallel fetch. This needs to be fleshed out.
3.1.8 Feature (FEATURES) categories
relevant list of features
• autoaddcvs
• buildpkg
• ccache
• collision-protect
• confcache
• cvs
• digest
• distcc
• distlocks
• fixpackages
• getbinpkg
• gpg
• keeptemp
• keepwork
• mirror
• noclean (keeptemp, keepwork)
• nodoc
• noinfo
• noman
• nostrip
• notitles
• sandbox
• severe
• severer (dumb spanky)
• sfperms
• sign
• strict
• suidctl
• test
• userpriv
3.1. Content
21
pkgcore Documentation, Release trunk
• userpriv_fakeroot
• usersandbox
Undefined
fixpackages
Dead
• usersandbox
• noclean
• getbinpkg (it’s a repo type, not a global feature)
• buildpkg (again, repo thing. moreso ui/buildplan execution)
Build
• keeptemp, keepwork, noclean, ccache, distcc
• sandbox, userpriv, fakeroot
• userpriv_fakeroot becomes fakeroot
• confcache
• noauto (fun one)
• test
repos or wrappers
Mutables
• autoaddcvs
• cvs
• digest
• gpg
• no{doc,info,man,strip}
• sign
• sfperms
• collision-protect (vdb only)
Immutables
• strict
• severe ; these two are repository opts on gpg repo class
22
Chapter 3. Developer Notes
pkgcore Documentation, Release trunk
Fetchers
• distlocks, sort of.
3.1.9 Filesystem Operations
Here we define types of operations that pkgcore will support, as well as the stages where these operations occur.
- File Deletion ( Removal )
• prerm
• unmerge files
• postrm
- File Addition ( Installation )
• preinst
• merge files
• postinst
- File Replacement ( Overwriting )
• preinst
• merge
• postinst
• prerm
• unmerge
• postrm
3.1.10 Python Code Guidelines
Note that not all of the existing code follows this style guide. This doesn’t mean existing code is correct.
Stats are all from a sempron 1.6Ghz with python 2.4.2.
Finally, code _should_ be documented, following epydoc/epytext guidelines
Follow pep8, with following exemptions
• <80 char limit is only applicable where it doesn’t make the logic ugly. This is not an excuse to have a 200 char
if statement (fix your logic). Use common sense.
• Combining imports is ok.
• Use absolute imports
• _Simple_ try/except combined lines are acceptable, but not forced (this is your call). example:
3.1. Content
23
pkgcore Documentation, Release trunk
try: l.remove(blah)
except IndexError: pass
• For comments, 2 spaces trailing is pointless- not needed.
• Classes should be named SomeClass, functions/methods should be named some_func.
• Exceptions are classes. Don’t raise strings.
• Avoid __var ‘private’ attributes unless you absolutely have a reason to hide it, and the class won’t be inherited
(or that attribute must _not_ be accessed)
• Using string module functions when you could use a string method is evil. Don’t do it.
• Use isinstance(str_instance, basestring) unless you _really_ need to know if it’s utf8/ascii
Throw self with a NotImplementedError
The reason for this is simple: if you just throw a NotImplementedError, you can’t tell how the path was hit if derivative
classes are involved; thus throw NotImplementedError(self, string_name_of_attr)
This gives far better tracebacks.
Be aware of what the interpreter is actually doing
Don’t use len(list_instance) when you just want to know if it’s nonempty/empty:
l=[1]
if l: blah
# instead of
if len(l): blah
python looks for __nonzero__, then __len__. It’s far faster than if you try to be explicit there:
python -m timeit -s ’l=[]’ ’if len(l) > 0: pass’
1000000 loops, best of 3: 0.705 usec per loop
python -m timeit -s ’l=[]’ ’if len(l): pass’
1000000 loops, best of 3: 0.689 usec per loop
python -m timeit -s ’l=[]’ ’if l: pass’
1000000 loops, best of 3: 0.302 usec per loop
Don’t explicitly use has_key. Rely on the ‘in’ operator
python -m ’timeit’ -s ’d=dict(zip(range(1000), range(1000)))’ ’d.has_key(1999999)’
1000000 loops, best of 3: 0.512 usec per loop
python -m ’timeit’ -s ’d=dict(zip(range(1000), range(1000)))’ ’1999999 in d’
1000000 loops, best of 3: 0.279 usec per loop
Python interprets the ‘in’ command by using __contains__ on the instance. The interpreter is faster at doing getattr’s
than actual python code is: for example, the code above uses d.__contains__ , if you do d.has_key or d.__contains__,
it’s the same speed. Using ‘in’ is faster because it has the interpreter do the lookup.
So be aware of how the interpreter will execute that code. Python code specified attribute access is slower then the
interpreter doing it on its own.
24
Chapter 3. Developer Notes
pkgcore Documentation, Release trunk
If you’re in doubt, python -m timeit is your friend. ;-)
Do not use [] or {} as default args in function/method definitions
>>> def f(default=[]):
>>>
default.append(1)
>>>
return default
>>> print f()
[1]
>>> print f()
[1,1]
When the function/class/method is defined, the default args are instantiated _then_, not per call. The end result of this
is that if it’s a mutable default arg, you should use None and test for it being None; this is exempted if you _know_ the
code doesn’t mangle the default.
Visible curried functions should have documentation
When using the currying methods (pkgcore.util.currying) for function mangling, preserve the documentation via
pretty_docs.
If this is exempted, pydoc output for objects isn’t incredibly useful.
Unit testing
All code _should_ have test case functionality. We use twisted.trial - you should be running >=2.2 (<2.2 results in
false positives in the spawn tests). Regressions should be test cased, exempting idiot mistakes (e.g, typos).
We are more than willing to look at code that lacks tests, but actually merging the code to integration requires that it
has tests.
One area that is (at the moment) exempted from this is the ebuild interaction; testing that interface is extremely hard,
although it _does_ need to be implemented.
If tests are missing from code (due to tests not being written initially), new tests are always desired.
If it’s FS related code, it’s _usually_ cheaper to try then to ask then try
...but you should verify it ;)
existing file (but empty to avoid reading overhead):
echo > dar
python -m ’timeit’ -s ’import os’ ’os.path.exists("dar") and open("dar").read()’
10000 loops, best of 3: 36.4 usec per loop
python -m ’timeit’ -s ’import os’ $’try:open("dar").read()\nexcept IOError: pass’
10000 loops, best of 3: 22 usec per loop
nonexistant file:
rm foo
python -m ’timeit’ -s ’import os’ ’os.path.exists("foo") and open("foo").read()’
10000 loops, best of 3: 29.8 usec per loop
3.1. Content
25
pkgcore Documentation, Release trunk
python -m ’timeit’ -s ’import os’ $’try:open("foo").read()\nexcept IOError: pass’
10000 loops, best of 3: 27.7 usec per loop
As you can see, there is a bit of a difference. :)
Note that this was qualified with “If it’s FS related code”; syscalls are not cheap- if it’s not triggering syscalls, the next
section is relevant.
Catching Exceptions in python code (rather then cpython) isn’t cheap
stats from python-2.4.2
When an exception is caught:
python -m ’timeit’ -s ’d=dict(zip(range(1000), range(1000)))’ $’try: d[1999]\nexcept KeyError: pass’
100000 loops, best of 3: 8.7 usec per loop
python -m ’timeit’ -s ’d=dict(zip(range(1000), range(1000)))’ $’1999 in d and d[1999]’
1000000 loops, best of 3: 0.492 usec per loop
When no exception is caught, overhead of try/except setup:
python -m ’timeit’ -s ’d=dict(zip(range(1000), range(1000)))’ $’try: d[0]\nexcept KeyError: pass’
1000000 loops, best of 3: 0.532 usec per loop
python -m ’timeit’ -s ’d=dict(zip(range(1000), range(1000)))’ $’d[0]’
1000000 loops, best of 3: 0.407 usec per loop
This doesn’t advocate writing code that doesn’t protect itself- just be aware of what the code is actually doing, and be
aware that exceptions in python code are costly due to the machinery involved.
Another example is when to use or not to use dict’s setdefault or get methods:
key exists:
# Through exception handling
python -m timeit -s ’d=dict.fromkeys(range(100))’ ’try: x=d[1]’ ’except KeyError: x=42’
1000000 loops, best of 3: 0.548 usec per loop
# d.get
python -m timeit -s ’d=dict.fromkeys(range(100))’ ’x=d.get(1, 42)’
1000000 loops, best of 3: 1.01 usec per loop
key doesn’t exist:
# Through exception handling
python -m timeit -s ’d=dict.fromkeys(range(100))’ ’try: x=d[101]’ ’except KeyError: x=42’
100000 loops, best of 3: 8.8 usec per loop
# d.get
python -m timeit -s ’d=dict.fromkeys(range(100))’ ’x=d.get(101, 42)’
1000000 loops, best of 3: 1.05 usec per loop
The short version of this is: if you know the key is there, dict.get() is slower. If you don’t, get is your friend. In other
words, use it instead of doing a containment test and then accessing the key.
Of course this only considers the case where the default value is simple. If it’s something more costly “except” will
do relatively better since it’s not constructing the default value if it’s not needed. So if in doubt and in a performancecritical piece of code: benchmark parts of it with timeit instead of assuming “exceptions are slow” or “[] is fast”.
26
Chapter 3. Developer Notes
pkgcore Documentation, Release trunk
cpython ‘leaks’ vars into local namespace for certain constructs
def f(s):
while True:
try:
some_func_that_throws_exception()
except Exception, e:
# e exists in this namespace now.
pass
# some other code here...
From the code above, e bled into the f namespace- that’s referenced memory that isn’t used, and will linger until the
while loop exits.
Python _does_ bleed variables into the local namespace- be aware of this, and explicitly delete references you don’t
need when dealing in large objs, especially dealing with exceptions:
class c:
d = {}
for x in range(1000):
d[x] = x
While the class above is contrived, the thing to note is that c.x is now valid- the x from the for loop bleeds into the
class namespace and stays put.
Don’t leave uneeded vars lingering in class namespace.
Variables that leak from for loops _normally_ aren’t an issue, just be aware it does occur- especially if the var is
referencing a large object (thus keeping it in memory).
So... for loops leak, list comps leak, dependent on your except clause they can also leak.
Do not go overboard with this though. If your function will exit soon do not bother cleaning up variables by hand. If
the “leaking” things are small do not bother either.
The current code deletes exception instances explicitly much more often than it should since this was believed to clean
up the traceback object. This does not work: the only thing “del e” frees up is the exception instance and the arguments
passed to its constructor. “del e” also takes a small amount of time to run (clearing up all locals when the function
exits is faster).
Unless you need to generate (and save) a range result, use xrange
:: python -m timeit ‘for x in range(10000): pass’ 100 loops, best of 3: 2.01 msec per loop
$ python -m timeit ‘for x in xrange(10000): pass’ 1000 loops, best of 3: 1.69 msec per loop
Removals from a list aren’t cheap, especially left most
If you _do_ need to do left most removals, the deque module is your friend.
Rightmost removals aren’t too cheap either, depending on what idiocy people come up with to try and ‘help’ the
interpreter:
python -m timeit $’l=range(1000);i=0;\nwhile i < len(l):\n\tif l[i]!="asdf":del l[i]\n\telse:i+=1’
100 loops, best of 3: 4.12 msec per loop
python -m timeit $’l=range(1000);\nfor i in xrange(len(l)-1,-1,-1):\n\tif l[i]!="asdf":del l[i]’
100 loops, best of 3: 3 msec per loop
3.1. Content
27
pkgcore Documentation, Release trunk
python -m timeit ’l=range(1000);l=[x for x in l if x == "asdf"]’
1000 loops, best of 3: 1 msec per loop
Granted, that’s worst case, but the worst case is usually where people get bitten (note the best case still is faster for list
comprehension).
On a related note, don’t pop() unless you have a reason to.
If you’re testing for None specifically, be aware of the ‘is’ operator
Is avoids the equality protocol, and does a straight ptr comparison:
python -m timeit ’10000000 != None’
1000000 loops, best of 3: 0.721 usec per loop
$ python -m timeit ’10000000 is not None’
1000000 loops, best of 3: 0.343 usec per loop
Note that we’re specificially forcing a large int; using 1 under 2.5 is the same runtime, the reason for this is that it
defaults to an identity check, then a comparison; for small ints, python uses singletons, thus identity kicks in.
Deprecated/crappy modules
• Don’t use types module. Use isinstance (this isn’t a speed reason, types sucks).
• Don’t use strings module. There are exceptions, but use string methods when available.
• Don’t use stat module just to get a stat attribute- e.g.,:: import stats l=os.stat(“asdf”)[stat.ST_MODE]
# can be done as (and a bit cleaner) l=os.stat(“asdf”).st_mode
Know the exceptions that are thrown, and catch just those you’re interested in
try:
blah
except Exception:
blah2
There is a major issue here. It catches SystemExit exceptions (triggered by keyboard interupts); meaning this code,
which was just bad exception handling now swallows Ctrl+c (meaning it now screws with UI code).
Catch what you’re interested in only.
tuples versus lists.
The former is immutable, while the latter is mutable.
Lists over-allocate (a cpython thing), meaning it takes up more memory then is used (this is actually a good thing
usually).
If you’re generating/storing a lot of sequences that shouldn’t be modified, use tuples. They’re cheaper in memory, and
people can reference the tuple directly without being concerned about it being mutated elsewhere.
However, using lists there would require each consumer to copy the list to protect themselves from mutation. So...
over-allocation + allocating a new list for each consumer.
Bad, mm’kay.
28
Chapter 3. Developer Notes
pkgcore Documentation, Release trunk
Don’t try to copy immutable instances (e.g. tuples/strings)
Example: copy.copy((1,2,3)) is dumb; nobody makes a mistake that obvious, but in larger code people do (people even
try using [:] to copy a string; it returns the same string since it’s immutable).
You can’t modify them, therefore there is no point in trying to make copies of them.
__del__ methods mess with garbage collection
__del__ methods have the annoying side affect of blocking garbage collection when that instance is involved in a
cycle- basically, the interpreter doesn’t know what __del__ is going to reference, so it’s unknowable (general case)
how to break the cycle.
So... if you’re using __del__ methods, make sure the instance doesn’t wind up in a cycle (whether careful data structs,
or weakref usage).
A general point: python isn’t slow, your algorithm is
l = []
for x in data_generator():
if x not in l:
l.append(x)
That code is _best_ case O(1) (e.g., yielding all 0’s). The worst case is O(N^2).
l=set()
for x in data_generator():
if x not in l:
l.add(x)
Best/Worst are now constant (this isn’t strictly true due to the potential expansion of the set internally, but that’s
ignorable in this case).
Furthermore, the first loop actually invokes the __eq__ protocol for x for each element, which can potentially be quite
slow if dealing in complex objs.
The second loop invokes __hash__ once on x instead (technically the set implementation may invoke __eq__ if a
collision occurs, but that’s an implementation detail).
Technically, the second loop still is a bit innefficient:
l=set(data_generator())
is simpler and faster.
An example data for people who don’t see how _bad_ this can get:
python -m timeit $’l=[]\nfor x in xrange(1000):\n\tif x not in l:l.append(x)’
10 loops, best of 3: 74.4 msec per loop
python -m timeit $’l=set()\nfor x in xrange(1000):\n\tif x not in l:l.add(x)’
1000 loops, best of 3: 1.24 msec per loop
python -m timeit ’l=set(xrange(1000))’
1000 loops, best of 3: 278 usec per loop
The difference here is obvious.
3.1. Content
29
pkgcore Documentation, Release trunk
This does _not_ mean that sets are automatically better everywhere, just be aware of what you’re doing- for a single
search of a range, the setup overhead is far slower then a linear search. Nature of sets, while the implementation may
be able to guess the proper list size, it still has to add each item in; if it cannot guess the size (ie, no size hint, generator,
iterator, etc), it has to just keep adding items in, expanding the set as needed (which requires linear walks for each
expansion). While this may seem obvious, people sometimes do effectively the following:
python -m timeit -s ’l=range(50)’ $’if 1001 in set(l): pass’
100000 loops, best of 3: 12.2 usec per loop
python -m timeit -s ’l=range(50)’ $’if 1001 in l: pass’
10000 loops, best of 3: 7.68 usec per loop
What’s up with __hash__ and dicts
A bunch of things (too many things most likely) in the codebase define __hash__. The rule for __hash__ is (quoted
from http://docs.python.org/ref/customization.html):
Should return a 32-bit integer usable as a hash value for dictionary operations. The only required property
is that objects which compare equal have the same hash value.
Here’s a quick rough explanation for people who do not know how a “dict” works internally:
• Things added to it are dumped in a “bucket” depending on their hash value.
• To check if something is in the dict it first determines the bucket to check (based on hash value), then does
equality checks (__cmp__ or __eq__ if there is one, otherwise object identity comparison) for everything in the
bucket (if there is anything).
So what does this mean?
• There’s no reason at all to define your own __hash__ unless you also define __eq__ or __cmp__. The behaviour
of your object in dicts/sets will not change, it will just be slower (since your own __hash__ is almost certainly
slower than the default one).
• If you define __eq__ or __cmp__ and want your object to be usable in a dict you have to define __hash__. If
you don’t the default __hash__ is used which means your objects act in dicts like only object identity matters
until you hit a hash collision and your own __eq__ or __cmp__ kicks in.
• If you do define your own __hash__ it has to produce the same value for objects that compare equal, or you get
really weird behaviour in dicts/sets (“thing in dict” returning False because the hash values differ while “thing
in dict.keys()” returns True because that does not use the hash value, only equality checks).
• If the hash value changes after the object was put in a dict you get weird behaviour too (“s=set([thing]);
thing.change_hash();thing in s” is False, but “thing in list(s)” is True). So if your objects are mutable they
can usually provide __eq__/__cmp__ but not __hash__.
• Not having many hash “collisions” (same hash value for objects that compare nonequal) is good, but collisions
are not illegal. Too many of them just slow down dict/set operations (in a worst case scenario of the same hash
for every object dict/set operations become linear searches through the single hash bucket everything ends up
in).
• If you use the hash value directly keep in mind that collisions are legal. Do not use comparisons of hash values
as a substitute for comparing objects (implementing __eq__ / __cmp__). Probably the only legitimate use of
hash() is to determine an object’s hash value based on things used for comparison.
__eq__ and __ne__
From http://docs.python.org/ref/customization.html:
30
Chapter 3. Developer Notes
pkgcore Documentation, Release trunk
There are no implied relationships among the comparison operators. The truth of x==y does not imply that
x!=y is false. Accordingly, when defining __eq__(), one should also define __ne__() so that the operators
will behave as expected.
They really mean that. If you define __eq__ but not __ne__ doing ”!=” on instances compares them by identity. This
is surprisingly easy to miss, especially since the natural way to write unit tests for classes with custom comparisons
goes like this:
self.assertEqual(YourClass(1), YourClass(1))
# Repeat for more possible values. Uses == and therefore __eq__,
# behaves as expected.
self.assertNotEqual(YourClass(1), YourClass(2))
# Repeat for more possible values. Uses != and therefore object
# identity, so they all pass (all different instances)!
So you end up only testing __eq__ on equal values (it can say “identical” for different values without you noticing).
Adding a __ne__ that just does “return not self == other” fixes this.
__eq__/__hash__ and subclassing
If your class has a custom __eq__ and it might be subclassed you have to be very careful about how you “compare” to
instances of a subclass. Usually you will want to be “different” from those unconditionally:
def __eq__(self, other):
if self.__class is not YourClass or other.__class__ is not YourClass:
return False
# Your actual code goes here
This might seem like overkill, but it is necessary to avoid problems if you are subclassed and the subclass does not
have a new __eq__. If you just do an “isinstance(other, self.__class__)” check you will compare equal to instances of a
subclass, which is usually not what you want. If you just check for “self.__class__ is other.__class__” then subclasses
that add a new attribute without overriding __eq__ will compare equal when they should not (because the new attribute
differs).
If you subclass something that has an __eq__ you should most likely override it (you might get away with not doing
so if the class does not do the type check demonstrated above). If you add a new attribute don’t forget to override
__hash__ too (that is not critical, but you will have unnecessary hash collisions if you forget it).
This is especially important for pkgcore because of pkgcore.util.caching. If an instance of a class with a broken __eq__
is used as argument for the __init__ of a class that uses caching.WeakInstMeta it will cause a cached instance to be
used when it should not. Notice the class with the broken __eq__ does not have to be cached itself to trigger this!
Getting this wrong can cause fun behaviour like atoms showing up in the list of fetchables because the restrictions
they’re in compare equal independent of their “payload”.
Exception subclassing
It is pretty common for an Exception subclass to want to customize the return value of str() on an instance. The easiest
way to do that is:
class MyException(Exception):
"""Describe when it is raised here."""
def __init__(self, stuff):
Exception.__init__(self, ’MyException because of %s’ % (stuff,))
3.1. Content
31
pkgcore Documentation, Release trunk
This is usually easier than defining a custom __str__ (since you do not have to store the value of “stuff” as an attribute)
and you should be calling the base class __init__ anyway.
(This does not mean you should never store things like “stuff” as attrs: it can be very useful for code catching the
exception to have access to it. Use common sense.)
Memory debugging
Either heappy, or dowser are the two currently recommended tools.
To use dowser, insert the following into the code wherever you’d like to check the heap- this is blocking also:
import cherrpy
import dowser
cherrypy.config.update({’engine.autoreload_on’: False})
try:
cherrypy.quickstart(dowser.Root())
except AttributeError:
cherrypy.root = dowser.Root()
cherrypy.server.start()
For using heappy, see the heappy documentation in pkgcore/dev-notes.
3.1.11 resolver
Current design doesn’t coalesce- expects that each atom as it’s passed in specifies the dbs, which is how it does it’s
update/empty-tree trickery.
This isn’t optimal. Need to flag specific atoms/matches as “upgrade if possible” or “empty tree if possible”, etc;
via this, we get coalescing behaviour. Specifically, if the targets are git[subversion] and subversion, we want both
upgraded. So when resolving git[subversion] and encountering dev-util/subversion, we should aim for upgrading it
per the commandline request.
Additional question- should we apply this coalescing awareness to intermediate atoms along the way resolution wise?
specifically, the cnf/dnf solutions, grabbing those and stating “yeah, collapse to these if possible since they’re likely
required” ?
3.1.12 resolver redesign
Hate to say it, but should go back to a specific ‘resolve’ method w/ the resolver plan object holding targets- reason
being, we may have to backtrack the whole way.
3.1.13 config/use issues
need to find a way to clone a stack, getting a standalone config stack if possible for the resolver- specifically so it
can do resets as needed, track what is involved (use dep forcing) w/out influencing preexisting access to that tree, nor
being affected by said usage.
3.1.14 hardlink merge
no comments, just need to get around to it.
32
Chapter 3. Developer Notes
pkgcore Documentation, Release trunk
3.1.15 How to use guppy/heapy for tracking down memory usage
This is a work in progress. It will grow a bit and it may not be entirely accurate everywhere.
Tutorial of sorts
All this was done on a checkout of [email protected], you should be able to
check that out and follow along using something like:
bzr revert -rrevid:[email protected]
in a pkgcore branch.
Heapy is powerful but has a learning curve.
Problems are the documentation (http://guppype.sourceforge.net/heapy_Use.html among others) is a bit unusual and there are various dynamic importing
and other tricks in use that mean things like dir() are less helpful than they are on more “normal” python objects. This
document’s main purpose is to show you how to ask heapy various kinds of questions. It may or may not show a few
cases where pkgcore uses more memory than it should too.
First, get an x86. Heapy currently does not like 64 bit archs much.
Emerge it:
emerge guppy
Fire up an interactive python prompt, set stuff up:
>>>
>>>
>>>
>>>
from guppy import hpy
from pkgcore.config import load_config
c = load_config()
hp = hpy()
Just to show how annoying heapy’s internal tricks are:
>>> dir(hp)
[’__doc__’, ’__getattr__’, ’__init__’, ’__module__’, ’__setattr__’, ’_hiding_tag_’, ’_import’, ’_name
>>> help(hp)
Help on class _GLUECLAMP_ in module guppy.etc.Glue:
_GLUECLAMP_ = <guppy.heapy.Use interface at 0x-484b8554>
This object is your “starting point”, but as you can see the underlying machinery is not giving away any useful usage
instructions.
Do everything that allocates some memory but is not the problem you are tracking down now. Then do:
>>> hp.setrelheap()
Everything allocated before this call will not be in the data sets you get later.
Now do your memory-intensive thing:
>>> l = list(x for x in c.repo["portdir"] if x.data)
Keep an eye on system memory consumption. You want to use up a lot but not all of your system ram for nicer
statistics. The python process was eating about 109M res in top when the above stuff finished, which is pretty good
(for my 512mb ram box).
>>> h = hp.heap()
3.1. Content
33
pkgcore Documentation, Release trunk
The fun one. This object is basically a snapshot of what’s reachable in ram (minus the stuff excluded through setrelheap
earlier) which you can do various fun tricks with. Its str() is a summary:
>>> h
Partition of a
Index Count
0 985931
1 24681
2 49391
3 115974
4 152181
5 36009
6 11328
7 24702
8 11424
9 24681
<54 more rows.
set of 1449133 objects. Total
%
Size
% Cumulative %
68 46300932 45 46300932 45
2 22311624 22 68612556 67
3 21311864 21 89924420 88
8 3776948
4 93701368 91
11 3043616
3 96744984 94
2 1584396
2 98329380 96
1 1540608
1 99869988 97
2
889272
1 100759260 98
1
851840
1 101611100 99
2
691068
1 102302168 100
Type e.g. ’_.more’ to view.>
size = 102766644 bytes.
Kind (class / dict of class)
str
dict of pkgcore.ebuild.ebuild_src.package
dict (no owner)
tuple
long
weakref.KeyedRef
dict of pkgcore.ebuild.ebuild_src.ThrowAwayNameSpace
types.MethodType
list
pkgcore.ebuild.ebuild_src.package
(You might want to keep an eye on ram usage: heapy made the process grow another dozen mb here. It gets painfully
slow if it starts swapping, so if that happens reduce your data set).
Notice the “Total size” in the top right: about 100M. That’s what we need to compare later numbers with.
So here we can see that (surprise!) we have a ton of strings in memory. We also have various kinds of dicts. Dicts are
treated a bit specially: the “dict of pkgcore.ebuild.ebuild_src.package” simply means “all the dicts that are __dict__
attributes of instances of that class”. “dict (no owner)” are all the dicts that are not used as __dict__ attribute.
You probably guessed what you can use “index” for:
>>> h[0]
Partition of a set of 985931 objects. Total size = 46300932 bytes.
Index Count
%
Size
% Cumulative % Kind (class / dict of class)
0 985931 100 46300932 100 46300932 100 str
Ok, that looks pretty useless, but it really is not. The “sets” heapy gives you (like “h” and “h[0]”) are a bunch of
objects, grouped together by an “equivalence relation”. The default one (with the crazy name “Clodo” for “Class or
dict owner”) groups together all objects of the same class and dicts with the same owner. We can also partition the sets
by a different equivalence relation. Let’s do a silly example first:
>>> h.bytype
Partition of a
Index Count
0 985931
1 85556
2 115974
3 152181
4 36009
5 24702
6 11424
7 24681
8 11328
9
408
<32 more rows.
set of 1449133 objects. Total
%
Size
% Cumulative %
68 46300932 45 46300932 45
6 45226592 44 91527524 89
8 3776948
4 95304472 93
11 3043616
3 98348088 96
2 1584396
2 99932484 97
2
889272
1 100821756 98
1
851840
1 101673596 99
2
691068
1 102364664 100
1
317184
0 102681848 100
0
26112
0 102707960 100
Type e.g. ’_.more’ to view.>
size = 102766644 bytes.
Type
str
dict
tuple
long
weakref.KeyedRef
types.MethodType
list
pkgcore.ebuild.ebuild_src.package
pkgcore.ebuild.ebuild_src.ThrowAwayNameSpace
types.CodeType
As you can see this is the same thing as the default view, but with all the dicts lumped together. A more useful one is:
>>> h.byrcs
Partition of a set of 1449133 objects. Total size = 102766644 bytes.
Index Count
%
Size
% Cumulative % Referrers by Kind (class / dict of class)
0 870779 60 43608088 42 43608088 42 dict (no owner)
34
Chapter 3. Developer Notes
pkgcore Documentation, Release trunk
1 24681
2 221936
3 242236
4
6
5 36009
2 22311624
15 20575932
17 8588560
0 1966736
2 1773024
6 11328
7 26483
8 11328
9
3
<132 more rows.
22
20
8
2
2
65919712
86495644
95084204
97050940
98823964
64
84
93
94
96
1 1540608
1 100364572 98
2
800432
1 101165004 98
1
724992
1 101889996 99
0
393444
0 102283440 100
Type e.g. ’_.more’ to view.>
pkgcore.ebuild.ebuild_src.package
dict of pkgcore.ebuild.ebuild_src.package
tuple
dict of weakref.WeakValueDictionary
dict (no owner), dict of
pkgcore.ebuild.ebuild_src.package, weakref.KeyedRef
pkgcore.ebuild.ebuild_src.ThrowAwayNameSpace
list
dict of pkgcore.ebuild.ebuild_src.ThrowAwayNameSpace
dict of pkgcore.repository.prototype.IterValLazyDict
What this does is:
• for every object, find all its referrers
• Classify those referrers using the “Clodo” relation you saw earlier
• Create a set of those classifiers of referrers. That means a set of things like “tuple, dict of someclass”, not of
actual referring objects.
• Group together all the objects with the same set of classifiers of referrers.
So now we know that we have a lot of objects referenced only by one or more dicts (still not very useful) and also a
lot of them referenced by one “normal” dict, referenced by the dict of (meaning “an attribute of”) ebuild_src.package,
and referenced by a WeakRef. Hmm, I wonder what those are. But let’s store this view of the data first, since it took a
while to generate (“_” is a feature of the python interpreter, it’s always the last result):
>>> byrcs = _
>>> byrcs[5]
Partition of a set of 36009 objects. Total size = 1773024 bytes.
Index Count
%
Size
% Cumulative % Referrers by Kind (class / dict of class)
0 36009 100 1773024 100
1773024 100 dict (no owner), dict of
pkgcore.ebuild.ebuild_src.package, weakref.KeyedRef
Erm, yes, we knew that already. If you look in the top right of the table you can see it is still grouping the items by the
kind of their referrer, which is not very useful here. To get more information we can change what they are grouped by:
>>> byrcs[5].byclodo
Partition of a set of 36009 objects. Total size = 1773024 bytes.
Index Count
%
Size
% Cumulative % Kind (class / dict of class)
0 36009 100 1773024 100
1773024 100 str
>>> byrcs[5].bysize
Partition of a set of 36009 objects. Total size = 1773024 bytes.
Index Count
%
Size
% Cumulative % Individual Size
0 10190 28
489120 28
489120 28
48
1
7584 21
394368 22
883488 50
52
2
7335 20
322740 18
1206228 68
44
3
3947 11
221032 12
1427260 80
56
4
3364
9
134560
8
1561820 88
40
5
1903
5
114180
6
1676000 95
60
6
877
2
56128
3
1732128 98
64
7
285
1
19380
1
1751508 99
68
8
451
1
16236
1
1767744 100
36
9
57
0
4104
0
1771848 100
72
This took the set of objects with that odd set of referrers and redisplayed them grouped by “clodo”. So now we know
they’re all strings. Most of them are pretty small too. To get some idea of what we’re dealing with we can pull some
random examples out:
3.1. Content
35
pkgcore Documentation, Release trunk
>>> byrcs[5].byid
Set of 36009 <str> objects. Total size = 1773024 bytes.
Index
Size
%
Cumulative %
Representation (limited)
0
80
0.0
80
0.0 ’media-plugin...re20051219-r1’
1
76
0.0
156
0.0 ’app-emulatio...4.20041102-r1’
2
76
0.0
232
0.0 ’dev-php5/ezc...hemaTiein-1.0’
3
76
0.0
308
0.0 ’games-misc/f...wski-20030120’
4
76
0.0
384
0.0 ’mail-client/...pt-viewer-0.8’
5
76
0.0
460
0.0 ’media-fonts/...-100dpi-1.0.0’
6
76
0.0
536
0.0 ’media-plugin...gdemux-0.10.4’
7
76
0.0
612
0.0 ’media-plugin...3_pre20051219’
8
76
0.0
688
0.0 ’media-plugin...3_pre20051219’
9
76
0.0
764
0.0 ’media-plugin...3_pre20060502’
>>> byrcs[5].byid[0].theone
’media-plugins/vdr-streamdev-server-0.3.3_pre20051219-r1’
A pattern emerges! (sets with one item have a “theone” attribute with the actual item, all sets have a “nodes” attribute
that returns an iterator yielding the items).
We could have used another heapy trick to get a better idea of what kind of string this was:
>>> byrcs[5].byvia
Partition of a set of 36009 objects. Total size = 1773024 bytes.
Index Count
%
Size
% Cumulative % Referred Via:
0
1
0
80
0
80
0 "[’cpvstr’]", ’.key’,
1
1
0
76
0
156
0 "[’cpvstr’]", ’.key’,
2
1
0
76
0
232
0 "[’cpvstr’]", ’.key’,
3
1
0
76
0
308
0 "[’cpvstr’]", ’.key’,
4
1
0
76
0
384
0 "[’cpvstr’]", ’.key’,
5
1
0
76
0
460
0 "[’cpvstr’]", ’.key’,
6
1
0
76
0
536
0 "[’cpvstr’]", ’.key’,
7
1
0
76
0
612
0 "[’cpvstr’]", ’.key’,
8
1
0
76
0
688
0 "[’cpvstr’]", ’.key’,
9
1
0
76
0
764
0 "[’cpvstr’]", ’.key’,
<35999 more rows. Type e.g. ’_.more’ to view.>
’.keys()[23147]’
’.keys()[12285]’
’.keys()[12286]’
’.keys()[16327]’
’.keys()[17754]’
’.keys()[19079]’
’.keys()[21704]’
’.keys()[23473]’
’.keys()[24239]’
’.keys()[3070]’
Ouch, 36009 total rows for 36009 objects. What this did is similar to what “byrcs” did: for every object in the set it
determined how they can be reached through their referrers, then groups objects that can be reached in the same ways
together. Unfortunately it is grouping everything reachable as a dictionary key differently, so this is not very useful.
XXX WTF XXX
It is not likely this accomplishes anything, but let’s assume we want to know if there are any objects in this set not
reachable as the “key” attribute. Heapy can tell us (although this is very slow... there might be a better way but I do
not know it yet):
>>> nonkeys = byrcs[5] & hp.Via(’.key’).alt(’<’)
>>> nonkeys.byrcs
hp.Nothing
(remember “hp” was our main entrance into heapy, the object that gave us the set of all objects we’re interested in
earlier).
What does this do? “hp.Via(‘.key’)” creates a “symbolic set” of “all objects reachable only as the ‘key’ attribute of
something” (it’s a “symbolic set” because there are no actual objects in it). The “alt” method gives us a new symbolic
set of everything reachable via “less than” this way. We then intersect this with our set and discover there is nothing
left.
A similar construct that does not do what we want is:
36
Chapter 3. Developer Notes
pkgcore Documentation, Release trunk
>>> nonkeys = byrcs[5] & ~hp.Via(’.key’)
The “~” operator inverts the symbolic set, giving a set matching everything not reachable exactly as a “key” attribute.
The key word here is “exactly”: since everything in our set was also reachable in two other ways this intersection
matches everything.
Ok, let’s get back to the stuff actually eating memory:
>>> h[0].byrcs
Index Count
%
Size
0 670791 68 31716096
1 139232 14 6525856
2 136558 14 6042408
3 36009
4 1773024
4
5
6
7
8
9
1762
824
140
194
30
55
0
0
0
0
0
0
107772
69476
31312
11504
6284
1972
% Cumulative % Referrers by Kind (class / dict of class)
68 31716096 68 dict (no owner)
14 38241952 83 tuple
13 44284360 96 dict of pkgcore.ebuild.ebuild_src.package
4 46057384 99 dict (no owner), dict of
pkgcore.ebuild.ebuild_src.package, weakref.KeyedRef
0 46165156 100 list
0 46234632 100 types.CodeType
0 46265944 100 function, tuple
0 46277448 100 dict of module
0 46283732 100 dict of type
0 46285704 100 dict of module, tuple
Remember h[0] gave us all str objects, so this is all string objects grouped by the kind(s) of their referrers. Also notice
index 3 here is the same set of stuff we saw earlier:
>>> h[0].byrcs[3] ^ byrcs[5]
hp.Nothing
Most operators do what you would expect, & intersects for example.
“We have a lot of strings in dicts” is not that useful either, let’s see if we can narrow that down a little:
>>> h[0].byrcs[0].referrers.byrcs
Partition of a set of 44124 objects. Total size = 18636768 bytes.
Index Count
%
Size
% Cumulative % Referrers by Kind (class / dict of class)
0 24681 56 12834120 69 12834120 69 dict of pkgcore.ebuild.ebuild_src.package
1 19426 44 5371024 29 18205144 98 dict (no owner)
2
1
0
393352
2 18598496 100 dict of pkgcore.repository.prototype.IterValLazyDict
3
1
0
6280
0 18604776 100 __builtin__.set
4
1
0
6280
0 18611056 100 dict of module, guppy.heapy.heapyc.RootStateType
5
1
0
6280
0 18617336 100 dict of pkgcore.ebuild.eclass_cache.cache
6
1
0
6280
0 18623616 100 dict of
pkgcore.repository.prototype.PackageIterValLazyDict
7
4
0
5536
0 18629152 100 type
8
4
0
3616
0 18632768 100 dict of type
9
1
0
1672
0 18634440 100 dict of module, dict of os._Environ
(Broken down: h[0].byrcs[0] is the set of all str objects referenced only by dicts, h[0].byrcs[0].referrers is the set of
those dicts, and the final .byrcs displays those dicts grouped by their referrers)
Keep an eye on the size column. We have over 12M worth of just dicts (not counting the stuff in them) referenced only
as attribute of ebuild_src.package. If we include the stuff kept alive by those dicts we’re talking about a big chunk of
the 100MB total here:
>>> t = _
>>> t[0].domisize
61269552
60M out of our 100M would be deallocated if we killed those dicts. So let’s ask heapy what dicts that are:
3.1. Content
37
pkgcore Documentation, Release trunk
>>> t[0].byvia
Partition of a set of 24681 objects. Total size = 12834120 bytes.
Index Count
%
Size
% Cumulative % Referred Via:
0 24681 100 12834120 100 12834120 100 "[’data’]"
(it is easy to get confused by the “byrcs” view of our “t”. t[0] is not a bunch of “dict of ebuild_src.package”. It is a
bunch of dicts with strings in them, namely those that are referred to by the dict of ebuild_src.package, and not by
anything else. So the byvia output means those dicts with strings in them are all “data” attributes of ebuild_src.package
instances).
(sidenote: earlier we saw byvia say ”.key”, now it says “[’data’]”. It’s different because the previous type used
__slots__ (so there was no “dict of” involved) and this type does not (so there is a “dict of” and our dicts are the “data”
key in it).
So what is in the dicts:
>>> t[0].referents
Partition of a set of 605577 objects. Total size = 34289392 bytes.
Index Count
%
Size
% Cumulative % Kind (class / dict of class)
0 556215 92 27710068 81 27710068 81 str
1 24681
4 6085704 18 33795772 99 dict (no owner)
2 24681
4
493620
1 34289392 100 long
>>> _.byvia
Partition of a set of 605577 objects. Total size = 34289392 bytes.
Index Count
%
Size
% Cumulative % Referred Via:
0 24681
4 6085704 18
6085704 18 "[’_eclasses_’]"
1 21954
4 3742976 11
9828680 29 "[’DEPEND’]"
2 22511
4 3300052 10 13128732 38 "[’RDEPEND’]"
3 24202
4 2631304
8 15760036 46 "[’SRC_URI’]"
4 24681
4 1831668
5 17591704 51 "[’DESCRIPTION’]"
5 24674
4 1476680
4 19068384 56 "[’HOMEPAGE’]"
6 24681
4 1297680
4 20366064 59 "[’KEYWORDS’]"
7 24681
4
888516
3 21254580 62 ’.keys()[3]’
8 24681
4
888516
3 22143096 65 ’.keys()[9]’
9 24681
4
810108
2 22953204 67 "[’LICENSE’]"
<32 more rows. Type e.g. ’_.more’ to view.>
Strings, nested dicts and longs, and most size eaten up by the “_eclasses_” values. There is also a significant amount
eaten up by keys values, which is a bit odd, so let’s investigate:
>>> refs = t[0].referents
>>> i=iter(refs.byvia[7].nodes)
>>> i.next()
’DESCRIPTION’
>>> i.next()
’DESCRIPTION’
>>> i.next()
’DESCRIPTION’
>>> i.next()
’DESCRIPTION’
>>> i.next()
’DESCRIPTION’
Eep!
>>> refs.byvia[7].bysize
Partition of a set of 24681 objects. Total size = 888516 bytes.
Index Count
%
Size
% Cumulative % Individual Size
0 24681 100
888516 100
888516 100
36
38
Chapter 3. Developer Notes
pkgcore Documentation, Release trunk
It looks like we have 24681 identical strings here, using up about 1M of memory. The other odd entry is the
‘_eclasses_’ string apparently.
Extra stuff for c extension developers
To provide accurate statistics if your code uses extension types you must provide heapy with a way to get the following
data for your custom types:
• How large is a certain instance?
• What objects does an instance contain?
• How does the instance refer to a contained object?
You provide these through a NyHeapDef struct, defined in heapdef.h in the guppy source. This header is not installed,
so you should just copy it into your source tree. It is a good idea to read this header file side by side with the following
descriptions, since it contains details omitted here. The stdtypes.c file contains implementations for the basic python
types which you can read for inspiration.
The NyHeapDef struct provides heapy with three function pointers:
SizeGetter
To answer “how large is an instance” you provide a NyHeapDef_SizeGetter function that is called with a PyObject*
and returns an int: the number of bytes the object occupies. If you do not provide this function heapy uses a default that
looks at the tp_basicsize and tp_itemsize fields of the type. This means that if you do not allocate any extra memory
for non-python objects (e.g. for c strings) you do not need to provide this function.
Traverser
To answer “What objects does an instance contain” you provide a traversal function (NyHeapDef_Traverser). This is
called with a pointer to a “visit procedure”, an instance of your extension type and some other stuff. You should then
call the visit procedure for every python object contained in your object.
This might sound familiar: to support the python garbage collector you provide a very similar function (tp_traverse).
Actually heapy will use tp_traverse if you do not provide a heapy-specific traverse function. Doing this makes sense
if you do not support the garbage collector for some reason, or if you contain objects that are irrelevant to the garbage
collector.
An example would be a type that contains a single python string object (that no other code can get a reference to). If
this object does not have references to other python objects it cannot be involved in cycles so supporting gc would be
useless. However you do still want heapy to know about the memory occupied by the contained string. You could do
that by adding that size in your NyHeapDef_SizeGetter function but it is probably easier to tell heapy about the string
through the traversal function (so you do not have to calculate the memory occupied by the string).
If the above type would also have a reference to some arbitrary (non-private) python object it should support gc, but
it does not need to tell gc about the contained string. So you would have two traversal functions, one for heapy that
visits the string and one for gc that does not.
RelationGetter
The last function heapy wants tells it in what way your instance refers to some contained object. It is used to provide
the “byvia” view. This calls a visit function once for each way your instance refers to a target object, telling it what
kind of reference it is.
3.1. Content
39
pkgcore Documentation, Release trunk
Providing the heapdef struct to heapy
Once you have the needed function pointers in a struct you need to pass this to heapy somehow. This is done through
a standard cpython mechanism called “cobjects”. From python these look like rather stupid objects you cannot do
anything with, but from c you can pull out a void* that was put in when the object was constructed. You can wrap an
arbitrary pointer in a CObject, make it available as attribute of your module, then import it from some other module,
pull the void* back out and cast it to the original type.
heapy looks for a _NyHeapDefs_ attribute on all loaded modules. If this attribute exists and is a CObject the pointer in
it is used as a pointer to an array of NyHeapDef struct (terminated with a struct with only nulls). Example code doing
this is in sets.c in the guppy source.
3.1.16 Plugins system
Goals
The plugin system (pkgcore.plugin) is used to pick up extra code (potentially distributed separately from pkgcore
itself) at a place where using the config system is not a good idea for some reason. This means that for a lot of things
that most people would call “plugins” you should not actually use pkgcore.plugin, you should use the config
system. Things like extra repository types should simply be used as “class” value in the configuration. The plugin
system is currently mainly used in places where handing in a ConfigManager is too inconvenient.
Using plugins
Plugins are looked up based on a string “key”. You can always look up all available plugins matching this key with
pkgcore.plugin.get_plugins(key). For some kinds of plugin (the ones defining a “priority” attribute) you
can also get the “best” plugin with pkgcore.plugin.get_plugin(key). This does not make sense for all
kinds of plugin, so not all of them define this.
The plugin system does not care about what kind of object plugins are, this depends entirely on the key.
Adding plugins
Basics, caching
Plugins for pkgcore are loaded from modules inside the pkgcore.plugins package. This package has some magic
to make plugins in any subdirectory pkgcore/plugins under a directory on sys.path work. So if pkgcore
itself is installed in site-packages you can still add plugins to /home/you/pythonlib/pkgcore/plugins if
/home/you/pythonlib is in PYTHONPATH. You should not put an __init__.py in this extra plugin directory.
Plugin modules should contain a pkgcore_plugins directory that maps the “key” strings to a sequence of plugins.
This dictionary has to be constant, since pkgcore keeps track of what plugin module provides plugins for what keys in
a cache file to avoid unnecessary imports. So this is invalid:
try:
import spork_package
except ImportError:
pkgcore_plugins = {}
else:
pkgcore_plugins = {’myplug’: [spork_package.ThePlugin]}
since if the plugin cache is generated while the package is not available pkgcore will cache the module as not providing
any myplug plugins, and the cache will not be updated if the package becomes available (only changes to the mtime
of actual plugin modules invalidate the cache). Instead you should do something like this:
40
Chapter 3. Developer Notes
pkgcore Documentation, Release trunk
try:
from spork_package import ThePlugin
except ImportError:
class ThePlugin(object):
disabled = True
pkgcore_plugins = {’myplug’: [ThePlugin]}
If a plugin has a “disabled” attribute the plugin system will never return it from get_plugin or get_plugins.
Priority
If you want your plugin to support get_plugin it should have a priority attribute: an integer indicating how
“preferred” this plugin is. The plugin with the highest priority (that is not disabled) is returned from get_plugin.
Some types of plugins need more information to determine a priority value. Those should not have a priority attribute.
They should use get_plugins instead and have a method that gets passed the extra data and returns the priority.
Import behaviour
Assuming the cache is working correctly (it was generated after installing a plugin as root) pkgcore will import all
plugin modules containing plugins for a requested key in priority order until it hits one that is not disabled. The
“disabled” value is not cached (a plugin that is unconditionally disabled makes no sense), but the priority value is. You
can fake a dynamic priority by having two instances of your plugin registered and only one of them enabled at the
same time.
This means it makes sense to have only one kind of plugin per plugin module (unless the required imports overlap):
this avoids pulling in imports for other kinds of plugin when one kind of plugin is requested.
The disabled value is not cached by the plugin system after the plugin module is imported. This means it should be a
simple attribute (either completely constant or set at import time) or property that does its own caching.
Adding a plugin package
Both get_plugin and get_plugins take a plugin package as second argument. This means you can use the
plugin system for external pkgcore-related tools without cluttering up the main pkgcore plugin directory. If you
do this you will probably want to copy the __path__ trick from pkgcore/plugin/__init__.py to support
plugins elsewhere on sys.path.
3.1.17 Pkgcore/Portage differences
Disclaimer
Pkgcore moves fairly fast in terms of development- we will strive to keep this doc up to date, but it may lag behind the
actual code.
Ebuild environment changes
All changes are either glep33 related, or a tightening of the restrictions on the env to block common snafus that localize
the ebuilds environment to that machine.
3.1. Content
41
pkgcore Documentation, Release trunk
• portageq based functions are disabled in the global scope.
Reasoning for this is that of QAhas_version/best_version must not affect the generated metadata. As such, portageq calls in the global scope
are disabled.
• inherit is disabled in all phases but depend and setup. Folks no longer do it, but inherit from within one of the
build/install phases is now actively blocked.
• The ebuild env is now effectively akin to suspending the process, and restarting it. Essentially, transitioning
between ebuild phases, the ebuild environment is snapshotted, cleaned of irrevelent data (bash forced vars for
example, or vars that pkgcore sets for the local system on each shift into a phase), and saved. Portage does
this partially (re-execs ebuilds/eclasses, thus stomping the env on each phase change), pkgcore does it fully. As
such, pkgcore is capable of glep33, while portage is not (env fixes are the basis of glep33).
• ebuild.sh now protects itself from basic fiddling. Ebuild generated state must work as long as the EAPI is
the same, regardless of the generating portage version, and the portage version that later uses the saved state
(simple example, generated with portage-2.51, if portage 3 is EAPI compliant with that env, it must not allow
it’s internal bash changes to break the env). As such, certain funcs are not modifiable by the ebuild- namely,
internal portage/pkgcore functionality, hasq/useq for example. Those functions that are read-only also are not
saved in the ebuild env (they should be supplied by the portage/pkgcore instance reloading the env).
• ebuild.sh is daemonized. The upshot of this is that regen is roughly 2x faster (careful reuse of ebuild.sh instances
rather then forcing bash to spawn all over). Additional upshot of this is that their are bidirectional communication
pipes between ebuild.sh and the python parent- env inspection, logging, passing requests up to the python side
(has_version/best_version for example) are now handled within the existing processes. Design of it from the
python side is that of an extensible event handler, as such it’s extremely easy to add new commands in, or special
case certain things.
Repository Enhancements
Pkgcore internally uses a sane/uniform repository abstraction- the benefits of this are:
• repository class (which implements the accessing of the on disk/remote tree) is pluggable. Remote vdb/portdir
is doable, as is having your repository tree ran strictly from downloaded metadata (for example), or running
from a tree stored in a tarball/zip file (mildly crazy, but it’s doable).
• seperated repository instances. We’ve not thrown out overlays (as paludis did), but pkgcore doesn’t force every
new repository to be an overlay of the ‘master’ PORTDIR as portage does.
• optimized repository classes- for the usual vdb and ebuild repository (those being on disk backwards compatible
with portage 2.x), the number of syscalls required was drastically reduced, with ondisk info (what packages
available per category for example) cached. It is a space vs time trade off, but the space trade off is neglible
(couple of dict with worst case, 66k mappings)- as is, portage’s listdir caching consumed a bit more memory
and was slower, so all in all a gain (mainly it’s faster with using slightly less memory then portages caching).
• unique package instances yielded from repository. Pkgcore uses a package abstraction internally for accessing
metadata/version/category, etc- all instances returned from repositories are unique immutable instances. Gain
of it is that if you’ve got dev-util/diffball-0.7.1 sitting in memory already, it will return that instance instead of
generating a new one- and since metadata is accessed via the instance, you get at most one load from the cache
backend per instance in memory- cache pull only occurs when required also. As such, far faster for when doing
random package accessing and storing of said packages (think repoman, dependency resolution, etc).
3.1.18 Tackling domain
tag a ‘x’ in front of stuff that’s been implemented
unhandled (eg, figure these out) vars/features
42
Chapter 3. Developer Notes
pkgcore Documentation, Release trunk
• (user)?sandbox
• userpriv(_fakeroot)?
• digest
• cvs (this option is a hack)
• fixpackages , which probably should be a sync thing (would need to bind the vdb and binpkg repo to it though)
• keep(temp|work), easy to implement, but where to define it?
• PORT_LOGDIR
• env overrides of use...
vdb wrapper/vdb repo instantiation (either domain created wrapper, or required in the vdb repo section def)
• CONFIG_PROTECT*
• collision-protect
• no(doc|man|info|clean) (wrapper/mangler)
• suidctl
• nostrip. in effect, strip defaults to on; wrappers if after occasionally on, occasionally off.
• sfperms
build section (vars)
• C(HOST|TARGET), (LD*|C*)FLAGS?
• (RESUME|FETCH)COMMAND are fetcher things, define it there.
• MAKEOPTS
• PORTAGE_NICENESS (imo)
• TMPDIR ? or domain it?
gpg is bound to repo, class type specifically. strict/severe are likely settings of it. the same applies for profiles.
distlocks is a fetcher thing, specifically (probably) class type.
buildpkgs is binpkg + filters.
package.provided is used to generate a seperate vdb, a null vdb that returns those packages as installed.
3.1.19 Testing
We use twisted.trial for our tests, to run the test framework run:
trial pkgcore
Your own tests must be stored in pkgcore.test - furthermore, tests must pass when ran repeatedly (-u option). You will
want at least twisted-2.2 for that, <2.2 has a few false positives.
Testing for negative assertions
When coding it’s easy to write test cases asserting that you get result xyz from foo, usually asserting the correct flow.
This is ok if nothing goes wrong, but that doesn’t normally happen. :)
3.1. Content
43
pkgcore Documentation, Release trunk
Negative assertions (there probably is a better term for it) means asserting failure conditions and ensuring that the code
handles zyx properly when it gets thrown at it. Most test cases seem to miss this, resulting in bugs being able to hide
away for when things go wrong.
Using –coverage
When writing tests for your code (or for existing code without any tests), it is very useful to use –coverage. Run
trial –coverage <path/to/test>, and then check <cwd>/_trial_temp/coverage/<test/module/name>. Any lines prefixed
with ‘>>>>>’ have not been covered by your tests. This should be rectified before your code is merged to mainline
(though this is not always possible). Those lines prefixed with a number show the number of times that line of code is
evaluated.
3.1.20 perl CPAN
• makeCPANstub in Gentoo/CPAN.pm , dumps cpan config
• screen scraping to get deps, example page http://kobesearch.cpan.org/, use getCPANInfo from CPAN
• use FindDeps for this
• use unmemoize(func) to back out the memoizing of a func; do this on FindDeps
3.1.21 dpkg
this is just basic notes, nothing more. If you know details, fill in the gaps kindly
repos are combined.
Sources.gz (list of source based deb’s) holds name, version, and build deps.
Packages.gz (binary debs, dpkgs) name, version, size, short and long description, bindeps.
repository layout:
dists
stable
main
arch #binary-arm fex
source #?
contrib #?
arch # binary-arm fex
source
non-free # guess.
arch
source
testing...
unstable...
arch/binary-* dirs hold Packages.gz, and Release (potentially) source dirs hold Sources.gz and Release (optionally)
has preinst, postinst, prerm, postrm Same semantics as ebuilds in terms of when to run (coincidence? :)
in dpkg
Build-Depends
Depends
Pre-Depends
Conflicts
44
in ebuild
our DEPEND
our RDEPEND
configure time DEPEND
blockers, affected by Essential (read up on this in debian policy guide)
Chapter 3. Developer Notes
pkgcore Documentation, Release trunk
3.1.22 WARNING
This is the original brain dump from harring; it is not guranteed to be accurate to the current design, it’s kept around
to give an idea of where things came from to contrast to what is in place now.
3.1.23 Introduction
e’yo. General description of layout/goals/info/etc, and semi sortta api.
That and aggregator of random ass crazy quotes should people get bored.
DISCLAIMER
This ain’t the code.
In other words, the actual design/code may be radically different, and this document probably will trail any major
overhauls of the design/code (speaking from past experience).
Updates welcome, as are suggestions and questions- please dig through all documentations in the dir this doc is in
however, since there is a lot of info (both current and historical) related to it. Collapsing info into this doc is attempted,
but explanation of the full restriction protocol (fex) is a lot of info, and original idea is from previous redesign err...
designs. Short version, historical, but still relevant info for restriction is in layout.txt. Other subsystems/design choices
have their basis quite likely from other docs in this directory, so do your homework please :)
Terminology
cp category/package
cpv category/package-version
ROOT livefs merge point, fex /home/bharring/embedded/arm-target or more commonly, root=/
vdb /var/db/pkg, installed packages database.
domain combination of repositories, root, and build information (use flags, cflags, etc). config data + repositories
effectively.
repository trees. ebuild tree (/usr/portage), binpkg tree, vdb tree, etc.
protocol python name for design/api. iter() fex, is a protocol; for iter(o) it does i=o.__iter__(); the returned object is
expected to yield an element when i.next() is called, till it runs out of elements (then throwing a StopIteration).
hesitate to call it defined hook on a class/instance, but this (crappy) description should suffice.
seq sequence, lists/tuples
set list without order (think dict.keys())
General design/idea/approach/requirements
All pythonic components installed by pkgcore must be within pkgcore.* namespace. No more polluting python’s
namespace, plain and simple. Third party plugins to pkgcore aren’t bound by this however (their mess, not ours).
API flows from the config definitions, everything internal is effectively the same. Basically, config data gives you your
starter objects which from there, you dig deeper into the innards as needed action wise.
The general design is intended to heavily abuse OOP. Further, delegation of actions down to components must be
abided by, example being repo + cache interaction. repo does what it can, but for searching the cache, let the cache
do it. Assume what you’re delegating to knows the best way to handle the request, and probably can do it’s job better
then some external caller (essentially).
3.1. Content
45
pkgcore Documentation, Release trunk
Actual configuration is pretty heavily redesigned. Classes and functions that should be constructed based on data from
the user’s configuration have a “hint” describing their arguments. The global config class uses these hints to convert
and typecheck the values in the user’s configuration. Actual configuration file reading and type conversion is done by
a separate class, meaning the global manager is not tied to a single format, or even to configuration read from a file on
disk.
Encapsulation, extensibility/modularity, delegation, and allowing parallelizing of development should be key focuses
in implementing/refining this high level design doc. Realize parallelizing is a funky statement, but it’s apt; work on
the repo implementations can proceed without being held up by cache work, and vice versa.
Final comment re: design goals, defining chunks of callable code and plugging it into the framework is another bit of
a goal. Think twisted, just not quite as prevalent (their needs/focus is much different from ours, twisted is the app,
your code is the lib, vice versa for pkgcore).
Back to config. Here’s general notion of config ‘chunks’ of the subsystem, (these map out to run time objects unless
otherwise stated):
domain
+-- profile (optional)
+-- fetcher (default)
+-- repositories
+-- resolver (default)
+-- build env data?
|
never actually instantiated, no object)
\-- livefs_repo (merge target, non optional)
repository
+-- cache
(optional)
+-- fetcher (optional)
+-- sync
(optional, may change)
\-- sync cache (optional, may chance)
profile
+-- build env?
+-- sets (system mainly).
\-- visibility wrappers
domain is configuration data, accept_(license|keywords), use, cflags, chost, features, etc. profile, dependent on the
profile class you choose is either bound to a repository, or to user defined location on disk (/etc/portage/profile fex).
Domain knows to do incremental crap upon profile settings, lifting package.* crap for visibility wrappers for repositories also.
repositories is pretty straightforward. portdir, binpkg, vdb, etc.
Back to domain. Domain’s are your definition of pretty much what can be done. Can’t do jack without a domain,
period. Can have multiple domains also, and domains do not have to be local (remote domains being a different class
type). Clarifying, think of 500 desktop boxes, and a master box that’s responsible for managing them. Define an appropriate domain class, and appropriate repository classes, and have a config that holds the 500 domains (representing
each box), and you can push updates out via standard api trickery. In other words, the magic is hidden away, just define
remote classes that match defined class rules (preferably inheriting from the base class, since isinstance sanity checks
will become the norm), and you could do emerge –domain some-remote-domain -u glsa on the master box. Emerge
won’t know it’s doing remote crap. Pkgcore won’t even. It’ll just load what you define in the config.
Ambitious? Yeah, a bit. Thing to note, the remote class additions will exist outside of pkgcore proper most likely.
Develop the code needed in parallel to fleshing pkgcore proper out.
Meanwhile, the remote bit + multiple domains + class overrides in config definition is _explicitly_ for the reasons
above. That and x-compile/embedded target building, which is a bit funkier.
Currently, portage has DEPEND and RDEPEND. How do you know what needs be native to that box to build the
46
Chapter 3. Developer Notes
pkgcore Documentation, Release trunk
package, what must be chost atoms? Literally, how do you know which atoms, say the toolchain, must be native vs
what package’s headers/libs must exist to build it? We need an additional metadata key, BDEPEND (build depends).
If you have BDEPEND, you know what actually is ran locally in building a package, vs what headers/libs are required.
Subtle difference, but BDEPEND would allow (with a sophisticated depresolver) toolchain to be represented in deps,
rather then the current unstated dep approach profiles allow.
Aside from that, BDEPEND could be used for x-compile via inter-domain deps; a ppc target domain on a x86 box
would require BDEPEND from the default domain (x86). So... that’s useful.
So far, no one has shot this down, moreso, come up with reasons as to why it wouldn’t work, the consensus thus far is
mainly “err, don’t want to add the deps, too much work”. Regarding work, use indirection.
virtual/toolchain-c metapkg (glep37) that expands out (dependent on arch) into whatever is required to do building
of c sources
virtual/toolchain-c++ same thing, just c++
virtual/autootols take a guess.
virtual/libc this should be tagged into rdepends where applicable, packages that directly require it (compiled crap
mainly)
Yes it’s extra work, but the metapkgs above should cover a large chunk of the tree, say >90%.
Config design
Portage thus far (<=2.0.51*) has had variable ROOT (livefs merge point), but no way to vary configuration data aside
from via a buttload of env vars. Further, there has been only one repository allowed (overlays are just that, extensions
of the ‘master’ repository). Addition of support of any new format is mildly insane due to hardcoding up the wing
wang in the code, and extension/modification of existing formats (ebuild) has some issues (namely the doebuild block
of code).
Goal is to address all of this crap. Format agnosticism at the repository level is via an abstracted repository design that
should supply generic inspection attributes to match other formats. Specialized searching is possible via match, thus
extending the extensibility of the prototype repository design.
Format agnosticism for building/merging is somewhat reliant on the repo, namely package abstraction, and abstraction
of building/merging operations.
On disk configurations for alternatives formats is extensible via changing section types, and plugging them into the
domain definition.
Note alt. formats quite likely will never be implemented in pkgcore proper, that’s kind of the domain of pkgcore
addons. In other words, dpkg/rpm/whatever quite likely won’t be worked on by pkgcore developers, at least not in the
near future (too many other things to do).
The intention is to generalize the framework so it’s possible for others to do so if they choose however.
Why is this good? Ebuild format has issues, as does our profile implementation. At some point, alternative
formats/non-backwards compatible tweaks to the formats (ebuild or profile) will occur, and then people will be quite
happy that the framework is generalized (seriously, nothing is lost from a proper abstracted design, and flexibility/power is gained).
config’s actions/operation
pkgcore.config.load_config() is the entrance point, returns to you a config object (pkgcore.config.central). This object
gives you access to the user defined configs, although only interest/poking at it should be to get a domain object from
it.
3.1. Content
47
pkgcore Documentation, Release trunk
domain object is instantiated by config object via user defined configuration. domains hold instantiated repositories,
bind profile + user prefs (use/accept_keywords) together, and _should_ simplify this data into somewhat user friendly
methods. (define this better).
Normal/default domain doesn’t know about other domains, nor give a damn. Embedded targets are domains, and
_will_ need to know about the livefs domain (root=/), so buildplan creation/handling may need to be bound into
domains.
Objects/subsystems/stuff
So... this is general naming of pretty much top level view of things, stuff emerge would be interested in (and would
fool with). hesitate to call it a general api, but it probably will be as such, exempting any abstraction layer/api over all
of this (good luck on that one }:] ).
IndexableSequence
functions as a set and dict, with caching and on the fly querying of info. mentioned due to use in repository and other
places... (it’s a useful lil sucker)
This actually is misnamed. the order of iteration isn’t necessarily reproducable, although it’s usually constant. IOW,
it’s normally a sequence, but the class doesn’t implicitly force it
LazyValDict
similar to ixseq, late loading of keys, on fly pulling of values as requested.
global config object (from pkgcore.config.load_config())
see config.rst.
domain object
bit of debate on this one I expect. any package.{mask,unmask,keywords} mangling is instantiated as a wrapper around
repository instances upon domain instantiation. code should be smart and lift any package.{mask,unmask,keywords}
wrappers from repositoriy instances and collapse it, pointing at the raw repo (basically don’t have N wrappers, collapse
it into a single wrapper). Not worth implementing until the wrapper is a faster implementation then the current
pkgcore.repository.visibility hack though (currently O(N) for each pkg instance, N being visibility restrictions/atoms).
Once it’s O(1), collapsing makes a bit more sense (can be done in parallel however).
a word on inter repository dependencies... simply put, if the repository only allows satisfying deps from the same
repository, the package instance’s *DEPEND atom conversions should include that restriction. Same trickery for
keeping ebuilds from depping on rpm/dpkg (and vice versa).
.repositories in the air somewhat on this one. either indexablesequence, or a repositorySet. Nice aspect of the latter
is you can just use .match with appropriate restrictions. very simply interface imo, although should provide
a way to pull individual repositories/labels of said repos from the set though. basically, mangle a .raw_repo
indexablesequence type trick (hackish, but nail it down when reach that bridge)
build plan creation
<TODO insert details as they’re fleshed out>
48
Chapter 3. Developer Notes
pkgcore Documentation, Release trunk
sets
TODO chuck in some details here. probably defined via user config and/or profile, although what’s it define?
atoms/restrictions? itermatch might be useful for a true set.
build/setup operation
(need a good name for this; dpkg/rpm/binpkg/ebuild’s ‘prepping’ for livefs merge should all fall under this, with
varying use of the hooks)
.build() do everything, calling all steps as needed
.setup() whatever tmp dirs required, create ‘em.
.req_files() (fetchables, although not necessarily with url (restrict=”fetch”...)
.unpack() guess.
.configure() unused till ebuild format version two (ya know, that overhaul we’ve been kicking around? :)
.compile() guess.
.test() guess.
.install() install to tmp location. may not be used dependent on the format.
.finalize() good to go. generate (jit?) contents/metadata attributes, or returns a finalized instance should generate a
immutable package instance.
repo change operation
base class.
.package package instance of what the action is centering around.
.start() notify repo we’re starting (locking mainly, although prerm/preinst hook also)
.finish() notify repo we’re done.
.run() high level, calls whatever funcs needed. individual methods are mainly for ui, this is if you don’t display “doing
install now... done... doing remove now... done” stuff.
remove operation
derivative of repo change operation.
.remove() guess.
.package package instance of what’s being yanked.
install operation
derivative of repo change operation
.package what’s being installed.
.install() install it baby.
3.1. Content
49
pkgcore Documentation, Release trunk
merge operation
derivative of repo remove and install (so it has .remove and .install, which must be called in .install and .remove order)
.replacing package instance of what’s being replaced.
.package what’s being installed
fetchables
basically a dict of stuff jammed together, just via attribute access (think c struct equiv)
.filename
.url tuple/list of url’s.
.chksums dict of chksum:val
fetcher
hey hey. take a guess.
worth noting, if fetchable lacks .chksums["size"], it’ll wipe any existing file. if size exists, and existing file is
bigger, wipe file, and start anew, otherwise resume. mirror expansion occurs here, also.
.fetch(fetchable, verifier=None) # if verifier handed in, does verification.
verifier
note this is basically lifted conceptually from mirror_dist. if wondering about the need/use of it, look at that source.
verify() handed a fetchable, either False or True
repository
this should be format agnostic, and hide any remote bits of it. this is general info for using it, not designing a repository
class
.mergable() true/false. pass a pkg to it, and it reports whether it can merge that or not.
.livefs boolean, indicative of whether or not it’s a livefs target- this is useful for resolver, shop it to other repos, binpkg
fex prior to shopping it to the vdb for merging to the fs. Or merge to livefs, then binpkg it while continuing
further building dependent on that package (ui app’s choice really).
.raw_repo either it weakref’s self, or non-weakref refs another repo. why is this useful? visibility wrappers... this
gives ya a way to see if p.mask is blocking usable packages fex. useful for the UI, not too much for pkgcore
innards.
.frozen boolean. basically, does it account for things changing without it’s knowledge, or does it not. frozen=True is
faster for ebuild trees for example, single check for cache staleness. frozen=False is slower, and is what portage
does now (meaning every lookup of a package, and instantiation of a package instance requires mtime checks
for staleness).
.categories IndexableSequence, if iterated over, gives ya all categories, if getitem lookup, sub-category category
lookups. think media/video/mplayer
50
Chapter 3. Developer Notes
pkgcore Documentation, Release trunk
.packages IndexableSequence, if iterated over, all package names. if getitem (with category as key), packages of that
category.
.versions IndexableSequence, if iterated over, all cpvs. if getitem (with cat/pkg as key), versions for that cp
.itermatch() iterable, given an atom/restriction, yields matching package instances.
.match() def match(self, atom):
return list(self.itermatch(atom)) voila.
.__iter__() in other words, repository is iterable. yields package instances.
.sync() sync, if the repo swings that way. flesh it out a bit, possibly handing in/back ui object for getting updates...
digressing for a moment...
note you can group repositories together, think portdir + portdir_overlay1 + portdir_overlay2. Creation of a repositoryset basically would involve passing multiple instantiating repo’s, and depending on that classes semantics, it internally
handles the stacking (right most positional arg repo overrides 2nd right most, ... overriding left most) So... stating it
again/clearly if it ain’t obvious, everything is configuration/instantiating of objects, chucked around/mangled by the
pkgcore framework.
What isn’t obvious is that since a repository set gets handed instantiated repositories, each repo, including the set
instance, can should be able to have it’s own cache (this is assuming it’s ebuild repos through and through). Why?
Cache data doesn’t change for the most part exempting which repo a cpv is from, and the eclass stacking. Handled
individually, a cache bound to portdir should be valid for portdir alone, it shouldn’t carry data that is a result of eclass
stacking from another overlay + that portdir. That’s the business of the repositoryset. Consequence of this is that the
repositoryset needs to basically reach down into the repository it’s wrapping, get the pkg data, then rerequest the keys
from that ebuild with a different eclass stack. This would be a bit expensive, although once inherit is converted to a
pythonic implementation (basically handing the path to the requested eclass down the pipes to ebuild*.sh), it should
be possible to trigger a fork in the inherit, and note python side that multiple sets of metadata are going to be coming
down the pipe. That should alleviate the cost a bit, but it also makes multiple levels of cache reflecting each repository
instance a bit nastier to pull off till it’s implemented.
So... short version. Harring is a perfectionist, and says it should be this way. reality of the situation makes it a bit
trickier. Anyone interested in attempting the mod, feel free, otherwise harring will take a crack at it since he’s being
anal about having it work in such a fashion.
Or... could do thus. repo + cache as a layer, wrapped with a ‘regen’ layer that handles cache regeneration as required.
Via that, would give the repositoryset a way to override and use it’s own specialized class that ensures each repo gets
what’s proper for it’s layer. Think raw_repo type trick.
continuing on...
cache
ebuild centric, although who knows (binpkg cache ain’t insane ya know). short version, it’s functionally a dict, with
sequence properties (iterating over all keys).
.keys() return every cpv/package in the db.
.readonly boolean. Is it modifiable?
.match() Flesh this out. Either handed a metadata restriction (or set of ‘em), or handed dict with equiv info (like the
former). ebuild caches most likely should return mtime information alongside, although maybe dependent on
readonly. purpose of this? Gives you a way to hand off metadata searching to the cache db, rather then the repo
having to resort to pulling each cpv from the cache and doing the check itself. This is what will make rdbms
cache backends finally stop sucking and seriously rocking, properly implemented at least. :) clarification, you
don’t call this directly, repo.match delegates off to this for metadata only restrictions
3.1. Content
51
pkgcore Documentation, Release trunk
package
this is a wrapped, constant package. configured ebuild src, binpkg, vdb pkg, etc. ebuild repositories don’t exactly and
return this- they return unconfigured pkgs, which I’m not going to go into right now (domains only see this protocol,
visibility wrappers see different)
.depends usual meaning. ctarget depends
.rdepends usual meaning. ctarget run time depends. seq,
.bdepends see ml discussion. chost depends, what’s executed in building this (toolchain fex). seq.
.files get a better name for this. doesn’t encompas files/*, but could be slipped in that way for remote. encompasses restrict fetch (files with urls), and chksum data. seq.
.description usual meaning, although remember probably need a way to merge metadata.xml lond desc into the more
mundane description key.
.license usual meaning, depset
.homepage usual. Needed?
.setup() Name sucks. gets ya the setup operation, which does building/whatever.
.data Raw data. may not exist, don’t screw with it unless you know what it is, and know the instance’s .data layout.
.build() if this package is buildable, return a build operation, else return None
restriction
see layout.txt for more fleshed out examples of the idea. note, match and pmatch have been reversed namewise.
.match() handed package instance, will return bool of whether or not this restriction matches.
.cmatch() try to force the changes; this is dependent on the package being configurable.
.itermatch() new one, debatable. short version, giving a sequence of package instances, yields true/false for them.
why might this be desirable? if setup of matching is expensive, this gives you a way to amoritize the cost. might
have use for glsa set target. define a restriction that limits to installed pkgs, yay/nay if update is avail...
restrictionSet
mentioning it merely cause it’s a grouping (boolean and/or) of individual restrictions an atom, which is in reality a
category restriction, package restriction, and/or version restriction is a boolean and set of restrictions
ContentsRestriction
whats this you say? a restriction for searching the vdb’s contents db? Perish the thought! ;)
metadataRestriction
Mentioning this for the sake of pointing out a subclass of it, DescriptionRestriction- this will be a class representing
matching against description data. See repo.match and cache.match above. The short version is that it encapsulates
the description search (a very slow search right now) so that repo.match can hand off to the cache (delegation), and the
cache can do the search itself, however it sees fit.
52
Chapter 3. Developer Notes
pkgcore Documentation, Release trunk
So... for the default cache, flat_list (19500 ebuilds == 19500 files to read for a full searchDesc), still is slow unless
flat_list gets some desc. cache added to it internally. If it’s a sql based cache, the sql_template should translate the
query into the appropriate select statement, which should make it much faster.
Restating that, delegation is absolutely required. There have been requests to add intermediate caches to the tree, or
move data (whether collapsing metadata.xml or moving data out of ebuilds) so that the form it is stored is in quicker
to search. These approaches are wrong. Should be clear from above that a repository can, and likely will be remote on
some boxes. Such a shift of metadata does nothing but make repository implementations that harder, and shift power
away from what knows best how to use it. Delegation is a massively more powerful approach, allowing for more
extensibility, flexibility and speed.
Final restating- searchDesc is matching against cache data. The cache (whether flat_list, anydbm, sqlite, or a remote
sql based cache) is the authority about the fastest way to do searches of it’s data. Programmers get pist off when users
try and tell them how something internally should be implemented- it’s fundamentally the same scenario. The cache
class the user chooses knows how to do it’s job the best, provide methods of handing control down to it, and let it
do it’s job (delegation). Otherwise you’ve got a backseat driver situation, which doesn’t let those in the know, do the
deciding (cache knows, repo doesn’t).
Mind you not trying to be harsh here. If in reading through the full doc you disagree, question it; if after speeding
up current cache implementation, note that any such change must be backwards compatible, and not screw up the
possibilities of encapsulation/delegation this design aims for.
logging
flesh this out (define this basically). short version, no more writemsg type trickery, use a proper logging framework.
ebuild-daemon.sh
Hardcoded paths have to go. /usr/lib/portage/bin == kill it. Upon initial loadup of ebuild.sh, dump the default/base
path down to the daemon, including a setting for /usr/lib/portage/bin . Likely declare -xr it, then load the actual
ebuild*.sh libs. Backwards compatibility for that is thus, ebuild.sh defines the var itself in global scope if it’s undefined.
Semblence of backwards compatibility (which is actually somewhat pointless since I’m about to blow it out of the
water).
Ebuild-daemon.sh needs a function for dumping a _large_ amount of data into bash, more then just a line or two.
For the ultra paranoid, we load up eclasses, ebuilds, profile.bashrc’s into python side, pipe that to gpg for verification,
then pipe that data straight into bash. No race condition possible for files used/transferred in this manner.
A thought. The screw around speed up hack preload_eclasses added in ebd’s heyday of making it as fast as possible
would be one route; Basically, after verification of an elib/eclass, preload the eclass into a func in the bash env. and
declare -r the func after the fork. This protects the func from being screwed with, and gives a way to (at least per ebd
instance) cache the verified bash code in memory.
It could work surprisingly enough (the preload_eclass command already works), and probably be fairly fast versus the
alternative. So... the race condition probably can be flat out killed off without massive issues. Still leaves a race for
perms on any files/*, but neh. A) That stuff shouldn’t be executed, B) security is good, but we can’t cover every
possibility (we can try, but dimishing returns)
A lesser, but still tough version of this is to use the indirection for actual sourcing to get paths instead. No
EBUILD_PATH, query python side for the path, which returns either ‘’ (which ebd interprets as “err, something
is whacked, time to scream”), or the actual path.
In terms of timing, gpg verification of ebuilds probably should occur prior to even spawning ebd.sh. profile, eclass,
and elib sourcing should use this technique to do on the fly verification though. Object interaction for that one is going
to be really fun, as will be mapping config settings to instantiation of objs.
3.1. Content
53
pkgcore Documentation, Release trunk
54
Chapter 3. Developer Notes
CHAPTER 4
Indices and tables
• genindex
• modindex
• search
55