respace
1.0
|
respacer is part of a larger project of mine that requires reconstructing sentences that have lost their whitespace. For example, given an input of "itisiyourking", we would like to produce "it is i your king", the most likely sentence originally. This mini-project serves to capture the work done for that purpose.
To achieve this goal, I have drawn from recent experience working with natural language processing tools.
I dare claim that it's performance is up there but I have little to compare it against. It is fast enough in the context of the aforementioned larger project so I will most probably not spend more time to optimize. Note also that I've only ever run this project on MacOS but the code is cross-platform and so are its dependencies.
respacer is dependent on libkenlm. libkenlm
provides the language model analysis facilities. libkenlm
is itself dependent on libboost-system
, libboost-thread
, libz
and libbz2
.
In order to use respacer, one must supply two files at run-time:
This sample code produces an executable that reads a string from standard input and produces a sentence including spaces on the standard output. It uses two included files:
aspell -d en dump master | aspell -l en expand > aspell_en_expanded
).bin/lmplz -o 3 < pg1112.txt > romeo_and_juliet.arpa; bin/build_binary romeo_and_juliet.arpa romeo_and_juliet.mmap
).(C) Copyright Thierry Seegers 2015. Distributed under the following license: Boost Software License - Version 1.0 - August 17th, 2003 Permission is hereby granted, free of charge, to any person or organization obtaining a copy of the software and accompanying documentation covered by this license (the "Software") to use, reproduce, display, distribute, execute, and transmit the Software, and to prepare derivative works of the Software, and to permit third-parties to whom the Software is furnished to do so, all subject to the following: The copyright notices in the Software and this entire statement, including the above license grant, this restriction and the following disclaimer, must be included in all copies of the Software, in whole or in part, and all derivative works of the Software, unless such copies or derivative works are solely in the form of machine-executable object code generated by a source language processor. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.