How wtou useJE EMLA A? 28 Ap pril 2015 JEMLA A v 1.0 NargessS. Bathae eian [email protected] [email protected] htttp://profs.b basu.ac.ir//bathaeian In many machine leaarning algoritthms, calculating inform mation entro opy is essential. Informattion entropy iis used in de ecision trees, discretizatiion, data analysis, language modelling and as w well as recent ap pplications ssuch as visuaalization. JEMLA is a Java packaage for calcu ulating inform mation entro opy in machine learning applicationss. JEMLA is an open sou urce packagee under MITT distribution n licence. JEMLA fo or users JEMLA fo or developerrs JEMLAforuserss 1‐ Insstall JRE 2‐ Preepare the input dataset.. It can be in ARFF formaat or a text file with a speecification file. Fo or further information, rread the Input Format seection. 3‐ Maake a batch ffile to run JEEMLA. Input arguments aare: a. Path aand name off the input file. b. Path aand name off the spec filee. If input file is in ARFF fformat then just type a ‐‐. c. Path aand name off the output file. Result o of calculation n of entropiees would be writte en at this filee. d. Letterr “a” followeed by list of in ndexes of features for caalculating en ntropy. If after letter “a” the worrd “all” appears then enttropies of all features will be calculatted. e. Letterr “i” followed d by method d of handlingg missing values. Each im mplemented metho od has an ind dex: i0: Iggnores missin ng values. i1: Fo or categoricaal features, rreplaces misssing values w with most common value; ffor numericaal features, rreplaces missing values with mean valuee. i2: In n each given class or concept, performs method i1 individually. i3: Fiinds the used d values of tthe feature ffor each giveen class and sortss them based d on the ratee of their usees. Missing vvalues would d be reeplaced with h different vaalues and no ot most comm mon values. Therrefore the rate of a valuee accounts fo or the fractio on of missingg valuees which rep placed by thaat value. i4: Peerforms like “i2” except that for num merical featu ures, replaaces missingg values with mid value. i5: Peerforms the “closest fit” algorithm. ““Closest fit” is a prep processing algorithm in w which for eacch instance in the dataseet, findss the instancce which has minimum Euclidian disttance with th hat instaance. Then reeplaces misssing values b based on mattching valuee in the o other instancce. Figure 1 d depicts need ded instructiions for runn ning JEMLA in a sample. Figure 1: running JEMLLA For each feature, the e entropy wo ould be calcu ulated. If thee feature is n numerical, th hen a thresho old value wo ould be prese ented as welll. Figure 2 depicts part o of the output file resulted from runn ning above insstruct. Becau use all features in that dataset are caategorical, n numbers ‐999 99999.0 are shown ass fake thresh holds. Figure 2: a ssample of outpu ut file JEMLAffordevelo opers o start their machine JEMLA haas an efficient object oriiented desiggn. It helps developers to learning projects eassily. It offers a simple way for storingg datasets an nd some algo orithms for preproceessing them. The follow wing diagram m shows thee class diagraam of JEMLA A. Figure 3: claass diagram of JEMLA The follow wing table de escribes the main classees and their u uses briefly. For more in nformation please seee the API documents of JJEMLA. Class DataaSet Data sets are stored aand retrieved d in this classs. The algorithm closestt‐fit is impleemented in tthis class. Class Instaance Every instance in a daata set is storred in this class. Algorith hm of findingg Euclidean distance between ttwo instance es is implem mented in this class. Class ListO OfValues Values of individual fe eatures can be retrieved d through this class. Class DisccretizedValues In this classs a supervissed algorithm m of discritizzation of num merical valu ues is implem mented. The algorithm is based on entropy. Class Entrropy Functions for calculating informattion entropyy as well as m mutual inforrmation entrropy are implemen nted in this cclass. In addition some algorithms fo or imputing aand handlingg missing values are implem mented in th his class too.
© Copyright 2024