Skip to content

ML notebook

Gezim Sejdiu edited this page Feb 12, 2018 · 1 revision

SANSA-ML is the Machine Learning (ML) library in the SANSA stack. Algorithms in this repository perform various machine learning tasks directly on RDF/OWL input data. While most machine learning algorithms are based on processing simple features, the machine learning algorithms in SANSA-ML exploit the graph structure and semantics of the background knowledge specified using the RDF and OWL standards. In many cases, this allows obtaining either more accurate or more human-understandable results. In contrast to most other algorithms supporting background knowledge, they scale horizontally using Apache Spark and Apache Flink.

RDF By Modularity Clustering example

import scala.collection.mutable
import org.apache.spark.sql.SparkSession
import org.apache.log4j.{ Level, Logger }
import net.sansa_stack.ml.spark.clustering.RDFByModularityClustering

val graphFile = "hdfs://namenode:8020/data/Clustering_sampledata.nt"
val outputFile = "hdfs://namenode:8020/data/clustering.out"
val numIterations = 10

RDFByModularityClustering(sc, numIterations, graphFile, outputFile)

Mines the Rules example

import scala.collection.mutable
import net.sansa_stack.ml.spark.mining.amieSpark.KBObject.KB
import net.sansa_stack.ml.spark.mining.amieSpark.{ RDFGraphLoader, DfLoader }
import net.sansa_stack.ml.spark.mining.amieSpark.MineRules.Algorithm

val input = "hdfs://namenode:8020/data/MineRules_sampledata.tsv"
val outputPath = "hdfs://namenode:8020/output"
val hdfsPath = outputPath + "/"

val know = new KB()
know.sethdfsPath(hdfsPath)
know.setKbSrc(input)

know.setKbGraph(RDFGraphLoader.loadFromFile(know.getKbSrc(), spark.sparkContext, 2))
know.setDFTable(DfLoader.loadFromFileDF(know.getKbSrc, spark.sparkContext, spark.sqlContext, 2))

val algo = new Algorithm(know, 0.01, 3, 0.1, hdfsPath)

var output = algo.ruleMining(spark.sparkContext, spark.sqlContext)
var outString = output.map { x =>
    var rdfTrp = x.getRule()
    var temp = ""
    for (i <- 0 to rdfTrp.length - 1) {
      if (i == 0) {
        temp = rdfTrp(i) + " <= "
      } else {
        temp += rdfTrp(i) + " \u2227 "
      }
    }
    temp = temp.stripSuffix(" \u2227 ")
    temp
  }.toSeq
  
var rddOut = spark.sparkContext.parallelize(outString).repartition(1)

rddOut.saveAsTextFile(outputPath + "/testOut")