How w to u use JE EMLA

 How
wtou
useJE
EMLA
A?
28 Ap
pril 2015 JEMLA
A v 1.0 NargessS. Bathae
eian [email protected]
[email protected] htttp://profs.b
basu.ac.ir//bathaeian
In many machine leaarning algoritthms, calculating inform
mation entro
opy is essential. Informattion entropy iis used in de
ecision trees, discretizatiion, data analysis, language modelling and as w
well as recent ap
pplications ssuch as visuaalization. JEMLA is a Java packaage for calcu
ulating inform
mation entro
opy in machine learning applicationss. JEMLA is an open sou
urce packagee under MITT distribution
n licence. JEMLA fo
or users JEMLA fo
or developerrs JEMLAforuserss
1‐ Insstall JRE 2‐ Preepare the input dataset.. It can be in ARFF formaat or a text file with a speecification file. Fo
or further information, rread the Input Format seection. 3‐ Maake a batch ffile to run JEEMLA. Input arguments aare: a. Path aand name off the input file. b. Path aand name off the spec filee. If input file is in ARFF fformat then just type a ‐‐. c. Path aand name off the output file. Result o
of calculation
n of entropiees would be
writte
en at this filee. d. Letterr “a” followeed by list of in
ndexes of features for caalculating en
ntropy. If after letter “a” the worrd “all” appears then enttropies of all features will be calculatted. e. Letterr “i” followed
d by method
d of handlingg missing values. Each im
mplemented
metho
od has an ind
dex:  i0: Iggnores missin
ng values.  i1: Fo
or categoricaal features, rreplaces misssing values w
with most common value; ffor numericaal features, rreplaces missing values with mean valuee.  i2: In
n each given class or concept, performs method i1 individually.  i3: Fiinds the used
d values of tthe feature ffor each giveen class and sortss them based
d on the ratee of their usees. Missing vvalues would
d be reeplaced with
h different vaalues and no
ot most comm
mon values. Therrefore the rate of a valuee accounts fo
or the fractio
on of missingg valuees which rep
placed by thaat value.  i4: Peerforms like “i2” except that for num
merical featu
ures, replaaces missingg values with mid value.
 i5: Peerforms the “closest fit” algorithm. ““Closest fit” is a prep
processing algorithm in w
which for eacch instance in the dataseet, findss the instancce which has minimum Euclidian disttance with th
hat instaance. Then reeplaces misssing values b
based on mattching valuee in the o
other instancce. Figure 1 d
depicts need
ded instructiions for runn
ning JEMLA in a sample.
Figure 1: running JEMLLA
For each feature, the
e entropy wo
ould be calcu
ulated. If thee feature is n
numerical, th
hen a thresho
old value wo
ould be prese
ented as welll. Figure 2 depicts part o
of the output file resulted from runn
ning above insstruct. Becau
use all features in that dataset are caategorical, n
numbers ‐999
99999.0 are
shown ass fake thresh
holds. Figure 2: a ssample of outpu
ut file
JEMLAffordevelo
opers
o start their machine JEMLA haas an efficient object oriiented desiggn. It helps developers to
learning projects eassily. It offers a simple way for storingg datasets an
nd some algo
orithms for preproceessing them.
The follow
wing diagram
m shows thee class diagraam of JEMLA
A. Figure 3: claass diagram of JEMLA The follow
wing table de
escribes the main classees and their u
uses briefly. For more in
nformation please seee the API documents of JJEMLA. Class DataaSet Data sets are stored aand retrieved
d in this classs. The algorithm closestt‐fit is impleemented in tthis class. Class Instaance Every instance in a daata set is storred in this class. Algorith
hm of findingg Euclidean distance between ttwo instance
es is implem
mented in this class. Class ListO
OfValues Values of individual fe
eatures can be retrieved
d through this class. Class DisccretizedValues In this classs a supervissed algorithm
m of discritizzation of num
merical valu
ues is implem
mented. The algorithm is based on entropy. Class Entrropy Functions for calculating informattion entropyy as well as m
mutual inforrmation entrropy are implemen
nted in this cclass. In addition some algorithms fo
or imputing aand handlingg missing values are implem
mented in th
his class too.