Lecture Note 6 for MBG 403 Data Database Management

Lecture Note for MBG 404
Jens Allmer
Lecture Note 6 for MBG 403
Data
Biological experiments generate large amounts of data. They may concern plant morphology
or may represent a collection of measurements such as leaf sizes or they may enclose a large
number of miRNA sequences. Many data items are worth storing such as gel images or PCR
settings. The need to save this information is evident and not critical with current computer
equipment. The information about relationships between data may, however, be lost in the
process.
Inter Data Relationships
Proteins from a crude cell extract separated by two dimensional gel electrophoresis with
subsequent measurement of the proteins in the spots by LC-MS/MS generate a huge amount
of data. The data is however not independent. It is, for example, essential to know where on
the gel the protein was detected in order to infer any biological knowledge. Storing all
information in a conventional file system, even if ordered and sorted using good naming
strategies and directory hierarchies, may lead to problems. Clearly, relating the data later
leads to the necessity to sift through a large number of files in order to relate information.
Database Management Systems
For solving the problem of modeling relationships among data, database management systems
(DBMS) come into play. In essence, a DBMS provides efficient storage, editing, and querying
facilities for data in a computer file system for many users simultaneously. Many additional
features to make these operations secure and efficient are part of a DBMS. The set of features,
efficiency, and security among many others are the reason why many DBMSs are available.
For this course Microsoft Access and OpenOffice Base which are targeting beginners and
occasional users of DBMSs will be used. Even though rather simple they can also be used in
multi user environments. The advantage in this course is the ease of installation and the
graphical querying interface which allows for querying without knowing a lot of structured
query language (SQL,which will be the topic of the next chapter). Other, more powerful
DBMSs are for example Oracle, MS SQLServer, MySQL, IBM DB2, and PostgreSQL to
name just a few.
Relation
The DBMS that will be used in this course are all relational. Relation is a term from
mathematics meaning a matrix or simply a table. It consists of rows and columns. While
columns are named, rows are not. Each column is associated with a set of possible or allowed
values. This is called the domain of the attribute. A domain can be similar to a type in other
programming languages for example integer or double could be domains allowing only that
type of number to be entered into that column.
A relation must have a name; the name in Figure 1 is written underneath the table but usually
it is written on top. Each attribute must have a unique name within the attributes of a relation.
A relation never contains data what is seen in Figure 1 is an instance of a relation. An instance
actually contains data, here two beers, whereas a relation only describes and limits the
possible content of a table. The data is referred to as rows or tuples. They are usually arrays of
data often from various, mixed domains. Each row in a table should be uniquely addressable.
1
Lecture Note for MBG 404
Jens Allmer
In Figure 1 this is possible using either name or manufacturer (manf) but when the instance
grows it may not be as simple anymore. A manufacturer may produce many beers and
different manufacturers could possibly give the same name to one of their beers.
Figure 1: A Relation with naming for most parts of it. The name of the relation is ‘Beers’.
Therefore, both name and manf are needed to unambiguously address tuple in the instance of
beers.
Table Space/ Database
One relation is similar to a spread sheet, not much is gained by storing one relation in a
DBMS. The power of a relational database is that it can model relationships among data
(relationship ≠ relation). This is similar to having several spreadsheets whose values have a
common purpose and combining them by their relationship.
A DBMS allows the creation of many databases each of which allows the storage of many
relations. Thus data can be highly organized within a DBMS.
2
Lecture Note for MBG 404
Jens Allmer
Looking at the two relations above, they may not have much in common. However in a
relational database the relationship that connects Sequences to Locations may reside in a third
table that models the many to many relationship.
Using the relation SeqLocs the information in Sequences and Locations can be related to each
other. This is of course a very simple example and even small database projects may consist
of many Tables.
Figure 2: Simplified table space of the 2DB database.
The tables PepProt, ORGDB and Members in Figure 2 serve the same function as SeqLocs in
the above example.
3
Lecture Note for MBG 404
Jens Allmer
Relationship Degrees
It is essential to model the relationships among data in order to avoid problem during the
lifetime of the database. Relationships are manifold; the easiest relationship is one to one.
One to one relationship
An example in Figure one could be the relationship between Identifications and Peptides.
Each identification can only be explained by one peptide. On the other hand, a peptide could
be identified many times which would be a one to many relationship if viewed from Peptides.
However there is no explicit relationship from Peptides to Identifications while there is one
from Identifications to Peptides (Identifications.PeptideID  Peptides.ID). A one to one
relationship is a special case of a one to many relationship.
One to many and many to one relationship
The relationship between Separations and Fractionations in Figure one is one to many. A
separation can contain many fractionations (A gel can have many bands). This is probably the
most common relationship among data with many examples in Figure 1. The opposite of a
many to one relationship is a one to many relationship. It solely depends on the view angle. A
more complex relationship is the many to many relationship.
Many to many relationship
A many to many relationship cannot be modeled using the two related tables, it always needs
a facilitator. This can be seen in Figure 1 (PepProt, ORGDB, ..) and in the example given
earlier. Basically, one information in one table can be related to many rows in another table
while each of these rows can be related to many rows in the initial table. In order to relate the
corresponding pieces of information a middleman is needed (e.g. SeqLocs). Looking from
Sequences towards Locations via SeqLocs it is possible to retrieve all records from Locations
that are related to a particular sequence (e.g. several organisms may have the same sequence).
Looking from the other direction it is possible to retrieve all sequences for a given locus.
Normalization
All of the relationships discussed above are void if only one table exists. If all data exists in
one table a lot of redundant information needs to be stored. Let’s imagine Sequences and
Locations would be one table, then for each Sequence all the information such as
chromosome, the location and the species must be stored. This is a waste of space since using
the relational database model only the relationship needs to be stored (SeqLocs). In most
cases this is more efficient. Furthermore there are many more problems that can be avoided if
data is properly organized into tables. Redundant data is, however, the biggest cause for
problems in a database therefore decomposition of the data into tables, called normalization,
is performed.
First Normal Form
The first normal form simply requires all attributes to have atomic values. This means that if
in a column titled Name in which first and last name are stored the relation is not in first
normal form. Simplistic we could say that if each piece of information has its own attribute,
the table is in first normal form.
4
Lecture Note for MBG 404
Jens Allmer
For more information go to: http://en.wikipedia.org/wiki/First_normal_form.
Second Normal Form
To be in second normal form a table needs to be in first normal form. Furthermore, there must
be a set of attributes (key) which can uniquely identify each tuple in any instance of the
relation. In addition to that at least one attribute must exist which is not part of this set.
For more information go to: http://en.wikipedia.org/wiki/Second_normal_form.
Third Normal Form
To be in third normal form a relation must be in second normal form first. Additionally each
of the attributes not part of a key must directly depend on one of the keys in the relation.
For more information go to: http://en.wikipedia.org/wiki/Third_normal_form.
In addition there are two further normal forms but they are not in widespread use. Bringing
the data that needs to be modeled into third normal form avoids most of the problems that can
occur like update and delete anomalies.
5
Lecture Note for MBG 404
Jens Allmer
Microsoft Access
After creating an MS Access database, this database can be filled with tables. Figure 3 shows
how that can be achieved.
Figure 3: An MS Access database with the table design view open. Three fields were created in Table1.
Accession is the key for the table and Type is a string (text). Sequence must hold more data therefore its
type is being set to Memo.
Repeatedly creating tables in database will create a table space with all related information
necessary for your database. Double clicking one of the tables opens a form which allows
editing of the information stored in the table (Figure 4). Large number of rows can be entered
into Access whereas the limit for Excel is 65536. Clearly, entering that amount of data into
Excel seems tedious and pointless. How can such a large table be review manually?
Relational databases come with a query language (SQL) which makes it possible to ask many
questions and retrieve answers from a database.
6
Lecture Note for MBG 404
Jens Allmer
Figure 4: Table Sequences open for editing of information.
It may be too much work to enter all information manually; therefore some import facilities
are available. Clicking on Files – Get External Data – Import brings up a dialog which allows
import from other databases e.g. Paradox and even text files.
Importing FASTA files is unfortunately not a feature of MS Access.
7