Diversity in software engineering research

Diversity in software
engineering research
Meiyappan Nagappan
Thomas Zimmermann
Christian Bird
Presented by Guilherme de Assis
1 Introduction
- Researchers have worked towards maximizing the impact of
software engineering research on practice.
- They tried to provide techniques and results that are general
as posible.
- “General conclusions from empirical studies in software
engineering are difficult because any process depends on a
potentially large number of relevant context variables”, Basili
et al.
2
1 Introduction
- Selecting projects for a study is an important step of the
study.
- More is not necessary better.
- Your sample must be representative of population.
- The article aims to:
- assess the quality of a sample
- identify projects that could be added to further
improve the quality of the sample.
3
1 Introduction
- Diversity: a diverse sample contains members of every
subgroup in the population and within the sample the
subgroups have roughly equal size.
- Representativeness: in a representative sample the size
of each subgroup in the sample is proportional to the size
of that subgroup in the population.
4
1 Introduction
- Sample Covarage: defined as the percentage of projects in a
population that are similar to a given sample.
- Sample coverage allows us to assess the quality of a given
sample; the higher the coverage, the better
- Further, it allows prioritizing projects that could be added to
further improve the quality of a given sample
5
2.1 Universe, Space, and Configuration
- Universe: a large set of projects, that can vary for diferent
research areas. Example: all open-source projects.
- Dimensions: characterizes each projects. Example: total lines
of code, number of developers.
- Space: the set of dimensions that are relevant for the
generality of a research. Example: total lines of code, number
of developers, main programming language.
6
2.1 Universe, Space, and Configuration
- For each dimension d, a similarity function that decides
whether two projects p1 and p2 are similar with respect to
that dimension was defined:
- The list of the similarity functions for a given space is called
the configuration.
- Similarity functions can vary across research studies. Ex: C
and C++.
7
2.1 Universe, Space, and Configuration
- To identify similar projects within the universe, we require the
projects to be similar to each other in all dimensions.
- If no similarity function is definied:
- For numeric dimensions(ex.: number of developers):
- For categorical dimensions(ex.: main programming
language):
- two projects are similar in a dimension if the values are
identical.
8
2.2 Example: Coverage and Project Selection
- Total: 50 projects.
- In total seven other projects
are similar to A. Thus project A
covers (7+1)/50=16% of the
universe.
- For individual dimensions:
project A covers 13/50=26% for
number of developers and
11/50=22% for lines of code.
9
2.2 Example: Coverage and Project Selection
- Show how a second project
increases the coverage.
- Adding Project B: universe coverage
increase to 18/50=36%. Covarege of
the developer to 60% and lines of code
to 56%.
- Adding Project C: The universe
covarage increases only to 18%.
10
2.2 Example: Coverage and Project Selection
- To provide a good coverage of the universe, one should
select projects that are diverse rather than similar to each
other.
11
2.2 Example: Coverage and Project Selection
12
2.2 Example: Coverage and Project Selection
13
2.3 Computing Coverage
- Sample coverage of a set of projects P for a given
universe U, an n-dimensional space D, and a configuration
(
1, … ,
) is computed as follows:
- Research topics can have different parameters for universe,
space, and configuration.
- It is important to not just report the coverage but also the context
14
in which it was computed.
3.1 The Ohloh Universe
- Universe: active projects that are monitored by the Ohloh
platform.
- Steps:
-
1. Extract the identifiers of active projects using Project API of Ohloh
to measure coverage for ongoing development.
2. Three diferent categories of data (each with one call to the API):
Analysis: main programming language, source code size and
contributors.
Activity category: how much source code developers have
changed each month
Factoid category: basic observations about projects such as
team size, project age, comment ratio, and license conflicts.
15
3.1 The Ohloh Universe
- 3. XML from Ohloh to text files. Remove projects that
don’t have main language or invalid data (negative
number of lines of code).
- The universe consists of a total of 20.028 projects.
16
3.2 The Ohloh Space
The dimensions used for the space are:
-
Main language. The most common programming language in the project.
Total lines of code. Blank lines and comment lines are excluded by Ohloh when
counting lines of code.
Number of contributors (12 months). Contributors with at least one commit in the
last 12 months.
Number of churn (12 months). Number of added and deleted lines of code,
excluding comment lines and blank lines, in the last 12 months.
Number of commits (12 months). Commits made in the last 12 months.
17
3.2 The Ohloh Space
-
Project age:
- < 1 year: Young
- between 1 and 3: Normal
- between 3 and 5: Old
- > 5: Very Old
-
Project activity. if during the last 12 calendar months, there were at least 25% fewer
commits than in the prior 12 months, the activity is Decreasing; if there were 25%
more commits, the activity is Increasing; otherwise the activity is Stable.
18
3.2 The Ohloh Space
19
3.3 Covering the Ohloh Universe
20
3.3 Covering the Ohloh Universe
21
3.4 Covering the Ohloh Universe with the
ICSE and FSE Conferences
- What are the most frequently used projects in the ICSE and
FSE conferences?
-
Eclipse Java Development Tools (JDT) - 16 papers
Apache HTTP Server - 12 papers
gzip, jEdit, Apache Xalan C++, Apache Lucene - 8 papers
Mozilla Firefox - 7 papers
- How much of the Ohloh universe do the ICSE and FSE
conferences cover?
-
The 207 Ohloh projects analyzed in the two years of the ICSE and
FSE conferences covered 9.15% of the Ohloh population.
There isn’t low, because values in all dimensions have to be similar for
a project to be similar to another.
22
3.4 Covering the Ohloh Universe with the
ICSE and FSE Conferences
- What are showcases of research with high coverage?
23
4 Discussion
- Coverage scores do not increase or decrease the importance
of research, but rather enhance our ability to reason about it.
- Results that differ can still have value, especially in a space
that is highly covered.
- Always report the universe, space, and configuration with any
coverage score.
24
5 Conclusion
- More isn’t necessary better.
- Technique to assess how well a research study covers a
population of software projects.
- This helps researchers to make informed decisions about
which projects to select for a study.
- Extends to researchers analyzing closed source projects.
- Describe the universe and space of their projects without
revealing confidential information about the projects or their
metrics and place their results in context.
25