Skip to content

Commit

Permalink
Merge pull request #11 from bab2min/develop
Browse files Browse the repository at this point in the history
Develop
  • Loading branch information
bab2min authored Nov 5, 2019
2 parents 0c6cb80 + a38eff3 commit 17a90aa
Show file tree
Hide file tree
Showing 60 changed files with 1,395 additions and 333 deletions.
7 changes: 6 additions & 1 deletion README.kr.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ tomotopy 란?

더 자세한 정보는 https://bab2min.github.io/tomotopy/index.kr.html 에서 확인하시길 바랍니다.

tomotopy의 가장 최신버전은 0.3.0 입니다.
tomotopy의 가장 최신버전은 0.3.1 입니다.

시작하기
---------------
Expand Down Expand Up @@ -195,6 +195,11 @@ tomotopy의 Python3 예제 코드는 https://github.com/bab2min/tomotopy/blob/ma

역사
-------
* 0.3.1 (2019-11-05)
* `min_cf` 혹은 `rm_top`가 설정되었을 때 `get_topic_dist()`의 반환값이 부정확한 문제를 수정하였습니다.
* `tomotopy.MGLDAModel` 모델의 문헌의 `get_topic_dist()`가 지역 토픽에 대한 분포도 함께 반환하도록 수정하였습니다..
* `tw=ONE`일때의 학습 속도가 개선되었습니다.

* 0.3.0 (2019-10-06)
* `tomotopy.LLDAModel` 토픽 모델이 새로 추가되었습니다.
* `HDPModel`을 학습할 때 프로그램이 종료되는 문제를 해결했습니다.
Expand Down
7 changes: 6 additions & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ The current version of `tomoto` supports several major topic models including

Please visit https://bab2min.github.io/tomotopy to see more information.

The most recent version of tomotopy is 0.3.0.
The most recent version of tomotopy is 0.3.1.

Getting Started
---------------
Expand Down Expand Up @@ -200,6 +200,11 @@ meaning you can use it for any reasonable purpose and remain in complete ownersh

History
-------
* 0.3.1 (2019-11-05)
* An issue where `get_topic_dist()` returns incorrect value when `min_cf` or `rm_top` is set was fixed.
* The return value of `get_topic_dist()` of `tomotopy.MGLDAModel` document was fixed to include local topics.
* The estimation speed with `tw=ONE` was improved.

* 0.3.0 (2019-10-06)
* A new model, `tomotopy.LLDAModel` was added into the package.
* A crashing issue of `HDPModel` was fixed.
Expand Down
6 changes: 3 additions & 3 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,8 @@
largs += ['-stdlib=libc++']
arch_levels = {'':'-march=native'}
elif 'manylinux' in os.environ.get('AUDITWHEEL_PLAT', ''):
cargs = ['-std=c++0x', '-O3', '-fpermissive']
arch_levels = {'':'', 'sse2':'-msse2', 'avx':'-mavx'}
cargs = ['-std=c++0x', '-O3', '-fpermissive', '-g0']
arch_levels = {'':'', 'sse2':'-msse2', 'avx':'-mavx', 'avx2':'-mavx2'}
else:
cargs = ['-std=c++0x', '-O3', '-fpermissive']
arch_levels = {'':'-march=native'}
Expand All @@ -47,7 +47,7 @@
setup(
name='tomotopy',

version='0.3.0',
version='0.3.1',

description='Tomoto, The Topic Modeling Tool for Python',
long_description=long_description,
Expand Down
10 changes: 5 additions & 5 deletions src/TopicModel/CT.h
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,13 @@

namespace tomoto
{
template<TermWeight _TW, bool _Shared = false>
struct DocumentCTM : public DocumentLDA<_TW, _Shared>
template<TermWeight _TW, size_t _Flags = 0>
struct DocumentCTM : public DocumentLDA<_TW, _Flags>
{
using DocumentLDA<_TW, _Shared>::DocumentLDA;
using DocumentLDA<_TW, _Flags>::DocumentLDA;
Eigen::Matrix<FLOAT, -1, -1> beta; // Dim: (K, betaSample)
Eigen::Matrix<FLOAT, -1, 1> smBeta; // Dim: K
DEFINE_SERIALIZER_AFTER_BASE2(DocumentLDA<_TW, _Shared>, smBeta);
DEFINE_SERIALIZER_AFTER_BASE2(DocumentLDA<_TW, _Flags>, smBeta);
};

class ICTModel : public ILDAModel
Expand All @@ -18,7 +18,7 @@ namespace tomoto
using DefaultDocType = DocumentCTM<TermWeight::one>;
static ICTModel* create(TermWeight _weight, size_t _K = 1,
FLOAT smoothingAlpha = 0.1, FLOAT _eta = 0.01,
const RANDGEN& _rg = RANDGEN{ std::random_device{}() });
const RandGen& _rg = RandGen{ std::random_device{}() });

virtual void setNumBetaSample(size_t numSample) = 0;
virtual size_t getNumBetaSample() const = 0;
Expand Down
2 changes: 1 addition & 1 deletion src/TopicModel/CTModel.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ namespace tomoto
template class CTModel<TermWeight::idf>;
template class CTModel<TermWeight::pmi>;

ICTModel* ICTModel::create(TermWeight _weight, size_t _K, FLOAT smoothingAlpha, FLOAT _eta, const RANDGEN& _rg)
ICTModel* ICTModel::create(TermWeight _weight, size_t _K, FLOAT smoothingAlpha, FLOAT _eta, const RandGen& _rg)
{
SWITCH_TW(_weight, CTModel, _K, smoothingAlpha, _eta, _rg);
}
Expand Down
29 changes: 13 additions & 16 deletions src/TopicModel/CTModel.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -16,19 +16,19 @@ namespace tomoto
{
};

template<TermWeight _TW, bool _Shared = false,
template<TermWeight _TW, size_t _Flags = 0,
typename _Interface = ICTModel,
typename _Derived = void,
typename _DocType = DocumentCTM<_TW>,
typename _ModelState = ModelStateCTM<_TW>>
class CTModel : public LDAModel<_TW, _Shared, _Interface,
typename std::conditional<std::is_same<_Derived, void>::value, CTModel<_TW, _Shared>, _Derived>::type,
class CTModel : public LDAModel<_TW, _Flags, _Interface,
typename std::conditional<std::is_same<_Derived, void>::value, CTModel<_TW, _Flags>, _Derived>::type,
_DocType, _ModelState>
{
static constexpr const char* TMID = "CTM";
protected:
using DerivedClass = typename std::conditional<std::is_same<_Derived, void>::value, CTModel<_TW>, _Derived>::type;
using BaseClass = LDAModel<_TW, _Shared, _Interface, DerivedClass, _DocType, _ModelState>;
using BaseClass = LDAModel<_TW, _Flags, _Interface, DerivedClass, _DocType, _ModelState>;
friend BaseClass;
friend typename BaseClass::BaseClass;
using WeightType = typename BaseClass::WeightType;
Expand All @@ -50,7 +50,7 @@ namespace tomoto
return &zLikelihood[0];
}

void updateBeta(_DocType& doc, RANDGEN& rg) const
void updateBeta(_DocType& doc, RandGen& rg) const
{
Eigen::Matrix<FLOAT, -1, 1> pbeta, lowerBound, upperBound;
constexpr FLOAT epsilon = 1e-8;
Expand All @@ -65,7 +65,7 @@ namespace tomoto
for (size_t k = 0; k < this->K; ++k)
{
FLOAT N_k = doc.numByTopic[k] + this->alpha;
FLOAT N_nk = doc.template getSumWordWeight<_TW>() + this->alpha * (this->K + 1) - N_k;
FLOAT N_nk = doc.getSumWordWeight() + this->alpha * (this->K + 1) - N_k;
FLOAT u1 = std::generate_canonical<FLOAT, 32>(rg), u2 = std::generate_canonical<FLOAT, 32>(rg);
FLOAT max_uk = epsilon + pow(u1, (FLOAT)1 / N_k) * (pbeta[k] - epsilon);
FLOAT min_unk = (1 - pow(u2, (FLOAT)1 / N_nk))
Expand Down Expand Up @@ -104,7 +104,7 @@ namespace tomoto
doc.smBeta /= doc.smBeta.array().sum();
}

void sampleDocument(_DocType& doc, size_t docId, _ModelState& ld, RANDGEN& rgs, size_t iterationCnt) const
void sampleDocument(_DocType& doc, size_t docId, _ModelState& ld, RandGen& rgs, size_t iterationCnt) const
{
BaseClass::sampleDocument(doc, docId, ld, rgs, iterationCnt);
if (iterationCnt >= this->burnIn && this->optimInterval && (iterationCnt + 1) % this->optimInterval == 0)
Expand All @@ -113,7 +113,7 @@ namespace tomoto
}
}

int restoreFromTrainingError(const exception::TrainingError& e, ThreadPool& pool, _ModelState* localData, RANDGEN* rgs)
int restoreFromTrainingError(const exception::TrainingError& e, ThreadPool& pool, _ModelState* localData, RandGen* rgs)
{
std::cerr << "Failed to sample! Reset prior and retry!" << std::endl;
const size_t chStride = std::min(pool.getNumWorkers() * 8, this->docs.size());
Expand All @@ -134,7 +134,7 @@ namespace tomoto
return 0;
}

void optimizeParameters(ThreadPool& pool, _ModelState* localData, RANDGEN* rgs)
void optimizeParameters(ThreadPool& pool, _ModelState* localData, RandGen* rgs)
{
std::vector<std::future<void>> res;
topicPrior = math::MultiNormalDistribution<FLOAT>::estimate([this](size_t i)
Expand Down Expand Up @@ -164,7 +164,7 @@ namespace tomoto
}
pbeta.array() -= last;
ll += topicPrior.getLL(pbeta.head(this->K));
ll += math::lgammaT(doc.template getSumWordWeight<_TW>() + alpha * K + 1);
ll += math::lgammaT(doc.getSumWordWeight() + alpha * K + 1);
}
return ll;
}
Expand Down Expand Up @@ -197,7 +197,7 @@ namespace tomoto
DEFINE_SERIALIZER_AFTER_BASE(BaseClass, numBetaSample, numTMNSample, topicPrior);

public:
CTModel(size_t _K = 1, FLOAT smoothingAlpha = 0.1, FLOAT _eta = 0.01, const RANDGEN& _rg = RANDGEN{ std::random_device{}() })
CTModel(size_t _K = 1, FLOAT smoothingAlpha = 0.1, FLOAT _eta = 0.01, const RandGen& _rg = RandGen{ std::random_device{}() })
: BaseClass(_K, smoothingAlpha, _eta, _rg)
{
this->optimInterval = 2;
Expand All @@ -206,11 +206,8 @@ namespace tomoto
std::vector<FLOAT> getTopicsByDoc(const _DocType& doc) const
{
std::vector<FLOAT> ret(this->K);
FLOAT sum = doc.template getSumWordWeight<_TW>();
for (size_t k = 0; k < this->K; ++k)
{
ret[k] = doc.numByTopic[k] / sum;
}
Eigen::Map<Eigen::Matrix<FLOAT, -1, 1>>{ret.data(), this->K}.array() =
doc.numByTopic.array().template cast<FLOAT>() / doc.getSumWordWeight();
return ret;
}

Expand Down
10 changes: 5 additions & 5 deletions src/TopicModel/DMR.h
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,13 @@

namespace tomoto
{
template<TermWeight _TW, bool _Shared = false>
struct DocumentDMR : public DocumentLDA<_TW, _Shared>
template<TermWeight _TW, size_t _Flags = 0>
struct DocumentDMR : public DocumentLDA<_TW, _Flags>
{
using DocumentLDA<_TW, _Shared>::DocumentLDA;
using DocumentLDA<_TW, _Flags>::DocumentLDA;
size_t metadata = 0;

DEFINE_SERIALIZER_AFTER_BASE2(DocumentLDA<_TW, _Shared>, metadata);
DEFINE_SERIALIZER_AFTER_BASE2(DocumentLDA<_TW, _Flags>, metadata);
};

class IDMRModel : public ILDAModel
Expand All @@ -18,7 +18,7 @@ namespace tomoto
using DefaultDocType = DocumentDMR<TermWeight::one>;
static IDMRModel* create(TermWeight _weight, size_t _K = 1,
FLOAT defaultAlpha = 1.0, FLOAT _sigma = 1.0, FLOAT _eta = 0.01, FLOAT _alphaEps = 1e-10,
const RANDGEN& _rg = RANDGEN{ std::random_device{}() });
const RandGen& _rg = RandGen{ std::random_device{}() });

virtual size_t addDoc(const std::vector<std::string>& words, const std::vector<std::string>& metadata) = 0;
virtual std::unique_ptr<DocumentBase> makeDoc(const std::vector<std::string>& words, const std::vector<std::string>& metadata) const = 0;
Expand Down
2 changes: 1 addition & 1 deletion src/TopicModel/DMRModel.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ namespace tomoto
template class DMRModel<TermWeight::idf>;
template class DMRModel<TermWeight::pmi>;

IDMRModel* IDMRModel::create(TermWeight _weight, size_t _K, FLOAT _defaultAlpha, FLOAT _sigma, FLOAT _eta, FLOAT _alphaEps, const RANDGEN& _rg)
IDMRModel* IDMRModel::create(TermWeight _weight, size_t _K, FLOAT _defaultAlpha, FLOAT _sigma, FLOAT _eta, FLOAT _alphaEps, const RandGen& _rg)
{
SWITCH_TW(_weight, DMRModel, _K, _defaultAlpha, _sigma, _eta, _alphaEps, _rg);
}
Expand Down
48 changes: 26 additions & 22 deletions src/TopicModel/DMRModel.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -16,19 +16,19 @@ namespace tomoto
Eigen::Matrix<FLOAT, -1, 1> tmpK;
};

template<TermWeight _TW, bool _Shared = false,
template<TermWeight _TW, size_t _Flags = 0,
typename _Interface = IDMRModel,
typename _Derived = void,
typename _DocType = DocumentDMR<_TW>,
typename _ModelState = ModelStateDMR<_TW>>
class DMRModel : public LDAModel<_TW, _Shared, _Interface,
typename std::conditional<std::is_same<_Derived, void>::value, DMRModel<_TW, _Shared>, _Derived>::type,
class DMRModel : public LDAModel<_TW, _Flags, _Interface,
typename std::conditional<std::is_same<_Derived, void>::value, DMRModel<_TW, _Flags>, _Derived>::type,
_DocType, _ModelState>
{
static constexpr const char* TMID = "DMR";
protected:
using DerivedClass = typename std::conditional<std::is_same<_Derived, void>::value, DMRModel<_TW>, _Derived>::type;
using BaseClass = LDAModel<_TW, _Shared, _Interface, DerivedClass, _DocType, _ModelState>;
using BaseClass = LDAModel<_TW, _Flags, _Interface, DerivedClass, _DocType, _ModelState>;
friend BaseClass;
friend typename BaseClass::BaseClass;
using WeightType = typename BaseClass::WeightType;
Expand Down Expand Up @@ -83,8 +83,8 @@ namespace tomoto
}
//val[K * F] = -(lgammaApprox(alphaDoc.array()) - lgammaApprox(doc.numByTopic.array().cast<FLOAT>() + alphaDoc.array())).sum();
//tmpK = -(digammaApprox(alphaDoc.array()) - digammaApprox(doc.numByTopic.array().cast<FLOAT>() + alphaDoc.array()));
val[K * F] += math::lgammaT(alphaSum) - math::lgammaT(doc.template getSumWordWeight<_TW>() + alphaSum);
FLOAT t = math::digammaT(alphaSum) - math::digammaT(doc.template getSumWordWeight<_TW>() + alphaSum);
val[K * F] += math::lgammaT(alphaSum) - math::lgammaT(doc.getSumWordWeight() + alphaSum);
FLOAT t = math::digammaT(alphaSum) - math::digammaT(doc.getSumWordWeight() + alphaSum);
if (!std::isfinite(alphaSum) && alphaSum > 0)
{
val[K * F] = -INFINITY;
Expand Down Expand Up @@ -116,7 +116,7 @@ namespace tomoto
}
}

void optimizeParameters(ThreadPool& pool, _ModelState* localData, RANDGEN* rgs)
void optimizeParameters(ThreadPool& pool, _ModelState* localData, RandGen* rgs)
{
Eigen::Matrix<FLOAT, -1, -1> bLambda;
FLOAT fx = 0, bestFx = INFINITY;
Expand All @@ -137,14 +137,21 @@ namespace tomoto
}
if (!std::isfinite(bestFx))
{
std::cout << "optimizing parameters has been failed!" << std::endl;
throw std::runtime_error{ "optimizing parameters has been failed!" };
throw exception::TrainingError{ "optimizing parameters has been failed!" };
}
lambda = bLambda;
//std::cerr << fx << std::endl;
expLambda = lambda.array().exp() + alphaEps;
}

int restoreFromTrainingError(const exception::TrainingError& e, ThreadPool& pool, _ModelState* localData, RandGen* rgs)
{
std::cerr << "Failed to optimize! Reset prior and retry!" << std::endl;
lambda.setZero();
expLambda = lambda.array().exp() + alphaEps;
return 0;
}

FLOAT* getZLikelihoods(_ModelState& ld, const _DocType& doc, size_t docId, size_t vid) const
{
const size_t V = this->realV;
Expand Down Expand Up @@ -173,7 +180,7 @@ namespace tomoto
ll += math::lgammaT(doc.numByTopic[k] + alphaDoc[k]);
ll -= math::lgammaT(alphaDoc[k]);
}
ll -= math::lgammaT(doc.template getSumWordWeight<_TW>() + alphaSum);
ll -= math::lgammaT(doc.getSumWordWeight() + alphaSum);
ll += math::lgammaT(alphaSum);
return ll;
}
Expand All @@ -194,7 +201,7 @@ namespace tomoto
{
ll += math::lgammaT(doc.numByTopic[k] + alphaDoc[k]) - math::lgammaT(alphaDoc[k]);
}
ll -= math::lgammaT(doc.template getSumWordWeight<_TW>() + alphaSum) - math::lgammaT(alphaSum);
ll -= math::lgammaT(doc.getSumWordWeight() + alphaSum) - math::lgammaT(alphaSum);
}
return ll;
}
Expand Down Expand Up @@ -231,7 +238,7 @@ namespace tomoto
{
lambda = Eigen::Matrix<FLOAT, -1, -1>::Constant(this->K, F, log(this->alpha));
}
if (_Shared) this->numByTopicDoc = Eigen::Matrix<WeightType, -1, -1>::Zero(this->K, this->docs.size());
if (_Flags & flags::continuous_doc_data) this->numByTopicDoc = Eigen::Matrix<WeightType, -1, -1>::Zero(this->K, this->docs.size());
expLambda = lambda.array().exp();
LBFGSpp::LBFGSParam<FLOAT> param;
param.max_iterations = maxBFGSIteration;
Expand All @@ -242,7 +249,7 @@ namespace tomoto

public:
DMRModel(size_t _K = 1, FLOAT defaultAlpha = 1.0, FLOAT _sigma = 1.0, FLOAT _eta = 0.01,
FLOAT _alphaEps = 0, const RANDGEN& _rg = RANDGEN{ std::random_device{}() })
FLOAT _alphaEps = 0, const RandGen& _rg = RandGen{ std::random_device{}() })
: BaseClass(_K, defaultAlpha, _eta, _rg), sigma(_sigma), alphaEps(_alphaEps)
{
}
Expand Down Expand Up @@ -285,11 +292,8 @@ namespace tomoto
{
std::vector<FLOAT> ret(this->K);
auto alphaDoc = expLambda.col(doc.metadata);
FLOAT sum = doc.template getSumWordWeight<_TW>() + alphaDoc.sum();
for (size_t k = 0; k < this->K; ++k)
{
ret[k] = (doc.numByTopic[k] + alphaDoc[k]) / sum;
}
Eigen::Map<Eigen::Matrix<FLOAT, -1, 1>>{ret.data(), this->K}.array() =
(doc.numByTopic.array().template cast<FLOAT>() + alphaDoc.array()) / (doc.getSumWordWeight() + alphaDoc.sum());
return ret;
}

Expand All @@ -311,11 +315,11 @@ namespace tomoto
};

/* This is for preventing 'undefined symbol' problem in compiling by clang. */
template<TermWeight _TW, bool _Shared,
template<TermWeight _TW, size_t _Flags,
typename _Interface, typename _Derived, typename _DocType, typename _ModelState>
constexpr FLOAT DMRModel<_TW, _Shared, _Interface, _Derived, _DocType, _ModelState>::maxLambda;
constexpr FLOAT DMRModel<_TW, _Flags, _Interface, _Derived, _DocType, _ModelState>::maxLambda;

template<TermWeight _TW, bool _Shared,
template<TermWeight _TW, size_t _Flags,
typename _Interface, typename _Derived, typename _DocType, typename _ModelState>
constexpr size_t DMRModel<_TW, _Shared, _Interface, _Derived, _DocType, _ModelState>::maxBFGSIteration;
constexpr size_t DMRModel<_TW, _Flags, _Interface, _Derived, _DocType, _ModelState>::maxBFGSIteration;
}
10 changes: 5 additions & 5 deletions src/TopicModel/GDMR.h
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,13 @@

namespace tomoto
{
template<TermWeight _TW, bool _Shared = false>
struct DocumentGDMR : public DocumentDMR<_TW, _Shared>
template<TermWeight _TW, size_t _Flags = 0>
struct DocumentGDMR : public DocumentDMR<_TW, _Flags>
{
using DocumentDMR<_TW, _Shared>::DocumentDMR;
using DocumentDMR<_TW, _Flags>::DocumentDMR;
std::vector<FLOAT> metadataC;

DEFINE_SERIALIZER_AFTER_BASE2(DocumentDMR<_TW, _Shared>, metadataC);
DEFINE_SERIALIZER_AFTER_BASE2(DocumentDMR<_TW, _Flags>, metadataC);
};

class IGDMRModel : public IDMRModel
Expand All @@ -18,7 +18,7 @@ namespace tomoto
using DefaultDocType = DocumentDMR<TermWeight::one>;
static IGDMRModel* create(TermWeight _weight, size_t _K = 1, const std::vector<size_t>& _degreeByF = {},
FLOAT defaultAlpha = 1.0, FLOAT _sigma = 1.0, FLOAT _eta = 0.01, FLOAT _alphaEps = 1e-10,
const RANDGEN& _rg = RANDGEN{ std::random_device{}() });
const RandGen& _rg = RandGen{ std::random_device{}() });

virtual FLOAT getSigma0() const = 0;
virtual void setSigma0(FLOAT) = 0;
Expand Down
Loading

0 comments on commit 17a90aa

Please sign in to comment.