@@ -268,32 +268,36 @@ <h1 id="missing-value-imputation-top" class="title">Missing Value Imputation</h1
268
268
better performance in cases where the missing data is structurally absent,
269
269
rather than missing due to measurement noise.</ p >
270
270
271
- < p > Smile provides several methods to impute missing values. The < code > NaN </ code > values
272
- in the input data matrix are treated as missing values and will be replaced with imputed
273
- values after the processing .</ p >
271
+ < p > Smile provides several methods to impute missing values. The < code > null </ code >
272
+ values in a DataFrame or < code > NaN </ code > values in a matrix are treated
273
+ as missing values and can be handled by the following mechanisms .</ p >
274
274
275
- < h2 id ="average " > Average Value Imputation </ h2 >
275
+ < h2 id ="simple " > SimpleImputer </ h2 >
276
276
277
- < p > In this approach, we impute missing values with the average of other attributes in the instance.
278
- Assume the attributes of the dataset are of same kind, e.g. microarray gene
279
- expression data, the missing values can be estimated as the average of
280
- non-missing attributes in the same instance. Note that this is not the
281
- average of same attribute across different instances.</ p >
277
+ < p > The < code > SimpleImputer</ code > replaces missing values with the constant value
278
+ along each column. By default, SimpleImputer imputes all the numeric
279
+ columns with median, boolean/nominal columns with mode, and text
280
+ columns with empty string. It is also possible to impute the numeric
281
+ columns with the mean of values in the range < code > [lower, upper]</ code > ,
282
+ where lower and upper are in terms of percentiles of the original distribution.</ p >
282
283
283
284
< ul class ="nav nav-tabs ">
284
- < li class ="active "> < a href ="#scala_1 " data-toggle ="tab "> Scala </ a > </ li >
285
+ < li class ="active "> < a href ="#java_1 " data-toggle ="tab "> Java </ a > </ li >
285
286
</ ul >
286
287
< div class ="tab-content ">
287
- < div class ="tab-pane active " id ="scala_1 ">
288
+ < div class ="tab-pane active " id ="java_1 ">
288
289
< div class ="code " style ="text-align: left; ">
289
- < pre class ="prettyprint lang-scala "> < code >
290
- def avgimpute(data: Array[Array[Double]]): Unit
290
+ < pre class ="prettyprint lang-java "> < code >
291
+ var format = CSVFormat.Builder.create().setDelimiter(' ').build();
292
+ var data = Read.csv("data/clustering/synthetic_control.data", format);
293
+ SimpleImputer imputer = SimpleImputer.fit(data);
294
+ var completeData = imputer.apply(data);
291
295
</ code > </ pre >
292
296
</ div >
293
297
</ div >
294
298
</ div >
295
299
296
- < h2 id ="knn "> K-Nearest Neighbor Imputation </ h2 >
300
+ < h2 id ="knn "> KNNImputer </ h2 >
297
301
298
302
< p > The KNN-based method selects instances similar to the instance of interest to impute
299
303
missing values. If we consider instance < code > A</ code > that has one missing value on
@@ -304,51 +308,40 @@ <h2 id="knn">K-Nearest Neighbor Imputation</h2>
304
308
neighbors is then used as an estimate for the missing value in instance < code > A</ code > .</ p >
305
309
306
310
< ul class ="nav nav-tabs ">
307
- < li class ="active "> < a href ="#scala_2 " data-toggle ="tab "> Scala </ a > </ li >
311
+ < li class ="active "> < a href ="#java_2 " data-toggle ="tab "> Java </ a > </ li >
308
312
</ ul >
309
313
< div class ="tab-content ">
310
- < div class ="tab-pane active " id ="scala_2 ">
314
+ < div class ="tab-pane active " id ="java_2 ">
311
315
< div class ="code " style ="text-align: left; ">
312
- < pre class ="prettyprint lang-scala "> < code >
313
- def knnimpute(data: Array[Array[Double]], k: Int = 5)
316
+ < pre class ="prettyprint lang-java "> < code >
317
+ // KNN is a lazy algorithm. So no "training" is needed.
318
+ var imputer = new KNNImputer(data, 5);
314
319
</ code > </ pre >
315
320
</ div >
316
321
</ div >
317
322
</ div >
318
323
319
- < h2 id ="kmeans "> K-Means Imputation </ h2 >
324
+ < h2 id ="kmeans "> KMedoidsImputer </ h2 >
320
325
321
- < p > This method first cluster data by K-Means
322
- with missing values and then impute missing values with the average value of each attribute
323
- in the clusters.</ p >
326
+ < p > The k-medoids algorithm is an adaptation of the k-means algorithm.
327
+ Rather than calculate the mean of the items in each cluster,
328
+ a representative item, or medoid, is chosen for each cluster
329
+ at each iteration. The missing values of an instance are replaced
330
+ the corresponding ones of the nearest medoid.</ p >
324
331
325
332
< ul class ="nav nav-tabs ">
326
- < li class ="active "> < a href ="#scala_3 " data-toggle ="tab "> Scala </ a > </ li >
333
+ < li class ="active "> < a href ="#java_3 " data-toggle ="tab "> Java </ a > </ li >
327
334
</ ul >
328
335
< div class ="tab-content ">
329
- < div class ="tab-pane active " id ="scala_3 ">
336
+ < div class ="tab-pane active " id ="java_3 ">
330
337
< div class ="code " style ="text-align: left; ">
331
- < pre class ="prettyprint lang-scala "> < code >
332
- def impute(data: Array[Array[Double]], k: Int, runs: Int = 1): Unit
333
- </ code > </ pre >
334
- </ div >
335
- </ div >
336
- </ div >
337
-
338
- < h2 id ="lls "> Local Least Squares Imputation</ h2 >
339
-
340
- < p > The local least squares imputation method represents a target instance that has missing values as
341
- a linear combination of similar instances, which are selected by k-nearest
342
- neighbors method.</ p >
343
-
344
- < ul class ="nav nav-tabs ">
345
- < li class ="active "> < a href ="#scala_4 " data-toggle ="tab "> Scala</ a > </ li >
346
- </ ul >
347
- < div class ="tab-content ">
348
- < div class ="tab-pane active " id ="scala_4 ">
349
- < div class ="code " style ="text-align: left; ">
350
- < pre class ="prettyprint lang-scala "> < code >
351
- def llsimpute(data: Array[Array[Double]], k: Int): Unit
338
+ < pre class ="prettyprint lang-java "> < code >
339
+ Distance<Tuple> distance = (x, y) -> {
340
+ double[] xd = x.toArray();
341
+ double[] yd = y.toArray();
342
+ return MathEx.squaredDistanceWithMissingValues(xd, yd);
343
+ };
344
+ var imputer = KMedoidsImputer.fit(data, distance,20);
352
345
</ code > </ pre >
353
346
</ div >
354
347
</ div >
@@ -374,14 +367,19 @@ <h2 id="svd">SVD Imputation</h2>
374
367
obtained matrix, until the total change in the matrix falls below the
375
368
empirically determined threshold (say 0.01).</ p >
376
369
370
+ < p > Different from above methods, < code > SVDImputer</ code > is applied on a < code > double[][]</ code >
371
+ matrix, where missing values are represented as < code > NaN</ code > . The output is also
372
+ a < code > double[][]</ code > matrix with imputed values.</ p >
373
+
377
374
< ul class ="nav nav-tabs ">
378
- < li class ="active "> < a href ="#scala_5 " data-toggle ="tab "> Scala </ a > </ li >
375
+ < li class ="active "> < a href ="#java_5 " data-toggle ="tab "> Java </ a > </ li >
379
376
</ ul >
380
377
< div class ="tab-content ">
381
- < div class ="tab-pane active " id ="scala_5 ">
378
+ < div class ="tab-pane active " id ="java_5 ">
382
379
< div class ="code " style ="text-align: left; ">
383
- < pre class ="prettyprint lang-scala "> < code >
384
- def svdimpute(data: Array[Array[Double]], k: Int, maxIter: Int = 10)): Unit
380
+ < pre class ="prettyprint lang-java "> < code >
381
+ var matrix = data.toArray();
382
+ double[][] completeMatrix = SVDImputer.impute(matrix, 5, 10)
385
383
</ code > </ pre >
386
384
</ div >
387
385
</ div >
0 commit comments