Some classes to represent elements in a text corpus.
Text corpus calculation in Javascript.
Eleventy plugin word count; which calcutates the number of words in a text corpus.
Georgian parallel text corpus - Essential 218 and GSFA 1200 sentences with Georgian translations
French parallel text corpus - Essential 218 and GSFA 1200 sentences with French translations
Malayalam parallel text corpus - Essential 218 and GSFA 1200 sentences with Malayalam translations
Japanese parallel text corpus - Essential 218 and GSFA 1200 sentences with Japanese translations
Khmer parallel text corpus - Essential 218 and GSFA 1200 sentences with Khmer translations
Vietnamese parallel text corpus - Essential 218 and GSFA 1200 sentences with Vietnamese translations
Thai parallel text corpus - Essential 218 and GSFA 1200 sentences with Thai translations
Chinese parallel text corpus - Essential 218 and GSFA 1200 sentences with Chinese translations
Indonesian parallel text corpus - Essential 218 and GSFA 1200 sentences with Indonesian translations
Malay parallel text corpus - Essential 218 and GSFA 1200 sentences with Malay translations
Hindi parallel text corpus - Essential 218 and GSFA 1200 sentences with Hindi translations
Korean parallel text corpus - Essential 218 and GSFA 1200 sentences with Korean translations
Spanish parallel text corpus - Essential 218 and GSFA 1200 sentences with Spanish translations
Burmese parallel text corpus - Essential 218 and GSFA 1200 sentences with Burmese translations
Lao parallel text corpus - Essential 218 and GSFA 1200 sentences with Lao translations
German parallel text corpus - Essential 218 sentences with German translations
A corpus of schematic layouts made with [tscircuit](https://github.com/tscircuit/tscircuit).
Neural Network
TypeScript port of tcvdb_text — BM25 sparse vector encoder for Tencent Cloud VectorDB
Almost complete English verb list.
AdiaUI A2UI training corpus — canonical v0.9 catalog + chunks + eval fixtures + feedback + gap registry. Consumed by the compose engine's retrieval layer + the MCP pipeline.
Term Frequency - Inverse Document Frequency
An implementation of the Lexrank Algorithm, which summarize corpus of text documents.
A Ruby port of Perl Lingua::EN::Tagger, a probability based, corpus-trained tagger that assigns POS tags to English text based on a lookup dictionary and a set of probability values.
Ruby port of TinySegmenter.js for tokenizing Japanese text. Uses a Naive Bayes model that has been trained using the RWCP corpus and optimized using L1-norm regularization. The resultant model is quite compact, yet has a 95% accuracy rate.
Maxixe is an implementation of the Tango algorithm describe in the paper "Mostly-unsupervised statistical segmentation of Japanese kanji sequences" by Ando and Lee. While the paper deals with Japanese characters, it should work on any unsegmented text given enough corpus data and a tuning of the algorithm parameters.
In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.
pikuri-vectordb gives a pikuri-core agent a +vectordb_search+ tool over a local document corpus — agentic search, the agent decides when to retrieve. Ships a swappable backend (a pure-Ruby +Backend::InMemory+ for teaching, plus thin +Backend::Qdrant+ / +Backend::Chroma+ HTTP clients for persistence — Qdrant recommended), a chunker, an embedder wrapper over +RubyLLM.embed+, and an optional +Reranker::LlamaServer+ that speaks +/v1/rerank+ against a cross-encoder model. Text extraction goes through +Pikuri::FileType.read_as_text+ in pikuri-core, which handles plain text / Markdown / PDF; HTML extraction is a deferred follow-up. Hosts wire the feature via +c.add_extension Pikuri::VectorDb::Extension.new(...)+ inside the +Agent.new+ block — same opt-in shape as +pikuri-tasks+ / +pikuri-skills+. The bundled +Pikuri::VectorDb::LIBRARIAN+ persona is the privilege-separated sub-agent counterpart for hosts that want recall to flow through a child rather than the parent's context. Three model endpoints in the full setup — chat (via ruby_llm), an embedder (via +RubyLLM.embed+), and an optional reranker (HTTP +/v1/rerank+). A single +llama-server+ in router mode serves all three by default, loading each cached GGUF on demand; see the gem's README for details.
== DESCRIPTION MMS2R is a library that decodes the parts of an MMS message to disk while stripping out advertising injected by the mobile carriers. MMS messages are multipart email and the carriers often inject branding into these messages. Use MMS2R if you want to get at the real user generated content from a MMS without having to deal with the cruft from the carriers. If MMS2R is not aware of a particular carrier no extra processing is done to the MMS other than decoding and consolidating its media. Contact the author to add additional carriers to be processed by the library. Suggestions and patches appreciated and welcomed! Corpus of carriers currently processed by MMS2R: * 1nbox/Idea: 1nbox.net * 3 Ireland: mms.3ireland.ie * Alltel: mms.alltel.com * AT&T/Cingular/Legacy: mms.att.net, txt.att.net, mmode.com, mms.mycingular.com, cingularme.com, mobile.mycingular.com pics.cingularme.com * Bell Canada: txt.bell.ca * Bell South / Suncom: bellsouth.net * Cricket Wireless: mms.mycricket.com * Dobson/Cellular One: mms.dobson.net * Helio: mms.myhelio.com * Hutchison 3G UK Ltd: mms.three.co.uk * INDOSAT M2: mobile.indosat.net.id * LUXGSM S.A.: mms.luxgsm.lu * Maroc Telecom / mms.mobileiam.ma * MTM South Africa: mms.mtn.co.za * NetCom (Norway): mms.netcom.no * Nextel: messaging.nextel.com * O2 Germany: mms.o2online.de * O2 UK: mediamessaging.o2.co.uk * Orange & Regional Oranges: orangemms.net, mmsemail.orange.pl, orange.fr * PLSPICTURES.COM mms hosting: waw.plspictures.com * PXT New Zealand: pxt.vodafone.net.nz * Rogers of Canada: rci.rogers.com * SaskTel: sms.sasktel.com * Sprint: pm.sprint.com, messaging.sprintpcs.com, sprintpcs.com * T-Mobile: tmomail.net, mmsreply.t-mobile.co.uk, tmo.blackberry.net * TELUS Corporation (Canada): mms.telusmobility.com, msg.telus.com * UAE MMS: mms.ae * Unicel: unicel.com, info2go.com (note: mobile number is tucked away in a text/plain part for unicel.com) * Verizon: vzwpix.com, vtext.com * Virgin Mobile: vmpix.com * Virgin Mobile of Canada: vmobile.ca * Vodacom: mms.vodacom4me.co.za
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.
No description provided.