Skip to content

Commit 97f372b

Browse files
committed
update docs
1 parent b687120 commit 97f372b

File tree

3 files changed

+177
-68
lines changed

3 files changed

+177
-68
lines changed

graph.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -236,7 +236,7 @@ <h1 id="graph-top" class="title">Graph Data Structure</h1>
236236
<div class="tab-pane active" id="java_2">
237237
<div class="code" style="text-align: left;">
238238
<pre class="prettyprint lang-java"><code>
239-
import smile.graph.*
239+
import smile.graph.*;
240240

241241
var graph = new AdjacencyList(8);
242242
graph.addEdge(0, 2);

missing-value-imputation.html

Lines changed: 47 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -268,32 +268,36 @@ <h1 id="missing-value-imputation-top" class="title">Missing Value Imputation</h1
268268
better performance in cases where the missing data is structurally absent,
269269
rather than missing due to measurement noise.</p>
270270

271-
<p>Smile provides several methods to impute missing values. The <code>NaN</code> values
272-
in the input data matrix are treated as missing values and will be replaced with imputed
273-
values after the processing.</p>
271+
<p>Smile provides several methods to impute missing values. The <code>null</code>
272+
values in a DataFrame or <code>NaN</code> values in a matrix are treated
273+
as missing values and can be handled by the following mechanisms.</p>
274274

275-
<h2 id="average">Average Value Imputation</h2>
275+
<h2 id="simple">SimpleImputer</h2>
276276

277-
<p>In this approach, we impute missing values with the average of other attributes in the instance.
278-
Assume the attributes of the dataset are of same kind, e.g. microarray gene
279-
expression data, the missing values can be estimated as the average of
280-
non-missing attributes in the same instance. Note that this is not the
281-
average of same attribute across different instances.</p>
277+
<p>The <code>SimpleImputer</code> replaces missing values with the constant value
278+
along each column. By default, SimpleImputer imputes all the numeric
279+
columns with median, boolean/nominal columns with mode, and text
280+
columns with empty string. It is also possible to impute the numeric
281+
columns with the mean of values in the range <code>[lower, upper]</code>,
282+
where lower and upper are in terms of percentiles of the original distribution.</p>
282283

283284
<ul class="nav nav-tabs">
284-
<li class="active"><a href="#scala_1" data-toggle="tab">Scala</a></li>
285+
<li class="active"><a href="#java_1" data-toggle="tab">Java</a></li>
285286
</ul>
286287
<div class="tab-content">
287-
<div class="tab-pane active" id="scala_1">
288+
<div class="tab-pane active" id="java_1">
288289
<div class="code" style="text-align: left;">
289-
<pre class="prettyprint lang-scala"><code>
290-
def avgimpute(data: Array[Array[Double]]): Unit
290+
<pre class="prettyprint lang-java"><code>
291+
var format = CSVFormat.Builder.create().setDelimiter(' ').build();
292+
var data = Read.csv("data/clustering/synthetic_control.data", format);
293+
SimpleImputer imputer = SimpleImputer.fit(data);
294+
var completeData = imputer.apply(data);
291295
</code></pre>
292296
</div>
293297
</div>
294298
</div>
295299

296-
<h2 id="knn">K-Nearest Neighbor Imputation</h2>
300+
<h2 id="knn">KNNImputer</h2>
297301

298302
<p>The KNN-based method selects instances similar to the instance of interest to impute
299303
missing values. If we consider instance <code>A</code> that has one missing value on
@@ -304,51 +308,40 @@ <h2 id="knn">K-Nearest Neighbor Imputation</h2>
304308
neighbors is then used as an estimate for the missing value in instance <code>A</code>.</p>
305309

306310
<ul class="nav nav-tabs">
307-
<li class="active"><a href="#scala_2" data-toggle="tab">Scala</a></li>
311+
<li class="active"><a href="#java_2" data-toggle="tab">Java</a></li>
308312
</ul>
309313
<div class="tab-content">
310-
<div class="tab-pane active" id="scala_2">
314+
<div class="tab-pane active" id="java_2">
311315
<div class="code" style="text-align: left;">
312-
<pre class="prettyprint lang-scala"><code>
313-
def knnimpute(data: Array[Array[Double]], k: Int = 5)
316+
<pre class="prettyprint lang-java"><code>
317+
// KNN is a lazy algorithm. So no "training" is needed.
318+
var imputer = new KNNImputer(data, 5);
314319
</code></pre>
315320
</div>
316321
</div>
317322
</div>
318323

319-
<h2 id="kmeans">K-Means Imputation</h2>
324+
<h2 id="kmeans">KMedoidsImputer</h2>
320325

321-
<p>This method first cluster data by K-Means
322-
with missing values and then impute missing values with the average value of each attribute
323-
in the clusters.</p>
326+
<p>The k-medoids algorithm is an adaptation of the k-means algorithm.
327+
Rather than calculate the mean of the items in each cluster,
328+
a representative item, or medoid, is chosen for each cluster
329+
at each iteration. The missing values of an instance are replaced
330+
the corresponding ones of the nearest medoid.</p>
324331

325332
<ul class="nav nav-tabs">
326-
<li class="active"><a href="#scala_3" data-toggle="tab">Scala</a></li>
333+
<li class="active"><a href="#java_3" data-toggle="tab">Java</a></li>
327334
</ul>
328335
<div class="tab-content">
329-
<div class="tab-pane active" id="scala_3">
336+
<div class="tab-pane active" id="java_3">
330337
<div class="code" style="text-align: left;">
331-
<pre class="prettyprint lang-scala"><code>
332-
def impute(data: Array[Array[Double]], k: Int, runs: Int = 1): Unit
333-
</code></pre>
334-
</div>
335-
</div>
336-
</div>
337-
338-
<h2 id="lls">Local Least Squares Imputation</h2>
339-
340-
<p>The local least squares imputation method represents a target instance that has missing values as
341-
a linear combination of similar instances, which are selected by k-nearest
342-
neighbors method.</p>
343-
344-
<ul class="nav nav-tabs">
345-
<li class="active"><a href="#scala_4" data-toggle="tab">Scala</a></li>
346-
</ul>
347-
<div class="tab-content">
348-
<div class="tab-pane active" id="scala_4">
349-
<div class="code" style="text-align: left;">
350-
<pre class="prettyprint lang-scala"><code>
351-
def llsimpute(data: Array[Array[Double]], k: Int): Unit
338+
<pre class="prettyprint lang-java"><code>
339+
Distance&lt;Tuple&gt; distance = (x, y) -> {
340+
double[] xd = x.toArray();
341+
double[] yd = y.toArray();
342+
return MathEx.squaredDistanceWithMissingValues(xd, yd);
343+
};
344+
var imputer = KMedoidsImputer.fit(data, distance,20);
352345
</code></pre>
353346
</div>
354347
</div>
@@ -374,14 +367,19 @@ <h2 id="svd">SVD Imputation</h2>
374367
obtained matrix, until the total change in the matrix falls below the
375368
empirically determined threshold (say 0.01).</p>
376369

370+
<p>Different from above methods, <code>SVDImputer</code> is applied on a <code>double[][]</code>
371+
matrix, where missing values are represented as <code>NaN</code>. The output is also
372+
a <code>double[][]</code> matrix with imputed values.</p>
373+
377374
<ul class="nav nav-tabs">
378-
<li class="active"><a href="#scala_5" data-toggle="tab">Scala</a></li>
375+
<li class="active"><a href="#java_5" data-toggle="tab">Java</a></li>
379376
</ul>
380377
<div class="tab-content">
381-
<div class="tab-pane active" id="scala_5">
378+
<div class="tab-pane active" id="java_5">
382379
<div class="code" style="text-align: left;">
383-
<pre class="prettyprint lang-scala"><code>
384-
def svdimpute(data: Array[Array[Double]], k: Int, maxIter: Int = 10)): Unit
380+
<pre class="prettyprint lang-java"><code>
381+
var matrix = data.toArray();
382+
double[][] completeMatrix = SVDImputer.impute(matrix, 5, 10)
385383
</code></pre>
386384
</div>
387385
</div>

0 commit comments

Comments
 (0)