sub-encoder — package search across npm, PyPI, crates.io, RubyGems, Go, Maven & NuGet

BETAmodules.com is in beta — open to partnerships & joint ventures.Build with us

RubyGems matches

3 matches · Ruby

pikuri-vectordbv0.0.4

pikuri-vectordb gives a pikuri-core agent a +vectordb_search+ tool over a local document corpus — agentic search, the agent decides when to retrieve. Ships a swappable backend (a pure-Ruby +Backend::InMemory+ for teaching and a thin +Backend::Chroma+ HTTP client for persistence), a chunker, an embedder wrapper over +RubyLLM.embed+, and an optional +Reranker::LlamaServer+ that speaks +/v1/rerank+ against a cross-encoder model. Text extraction goes through +Pikuri::FileType.read_as_text+ in pikuri-core, which handles plain text / Markdown / PDF; HTML extraction is a deferred follow-up. Hosts wire the feature via +c.add_extension Pikuri::VectorDb::Extension.new(...)+ inside the +Agent.new+ block — same opt-in shape as +pikuri-tasks+ / +pikuri-skills+. The bundled +Pikuri::VectorDb::LIBRARIAN+ persona is the privilege-separated sub-agent counterpart for hosts that want recall to flow through a child rather than the parent's context. Three model endpoints in the full setup — chat (via ruby_llm), an embedder (via +RubyLLM.embed+), and an optional reranker (HTTP +/v1/rerank+). A single +llama-server+ in router mode serves all three by default, loading each cached GGUF on demand; see the gem's README for details.

Maintained. Niche but maintained, actively maintained.

icu4r_19v1.0

RubyGems

== ICU4R - ICU Unicode bindings for Ruby ICU4R is an attempt to provide better Unicode support for Ruby, where it lacks for a long time. Current code is mostly rewritten string.c from Ruby 1.8.3. ICU4R is Ruby C-extension binding for ICU library[1] and provides following classes and functionality: * UString: - String-like class with internal UTF16 storage; - UCA rules for UString comparisons (<=>, casecmp); - encoding(codepage) conversion; \ - Unicode normalization; - transliteration, also rule-based; Bunch of locale-sensitive functions: - upcase/downcase; - string collation; \ - string search; - iterators over text line/word/char/sentence breaks; \ - message formatting (number/currency/string/time); - date and number parsing. * URegexp - unicode regular expressions. * UResourceBundle - access to resource bundles, including ICU locale data. * UCalendar - date manipulation and timezone info. * UConverter - codepage conversions API * UCollator - locale-sensitive string comparison == Install and usage > ruby extconf.rb > make && make check > make install Now, in your scripts just require 'icu4r'. To create RDoc, run > sh tools/doc.sh == Requirements To build and use ICU4R you will need GCC and ICU v3.4 libraries[2]. == Differences from Ruby String and Regexp classes === UString vs String 1. UString substring/index methods use UTF16 codeunit indexes, not code points. 2. UString supports most methods from String class. Missing methods are: capitalize, capitalize!, swapcase, swapcase! %, center, ljust, rjust chomp, chomp!, chop, chop! \ count, delete, delete!, squeeze, squeeze!, tr, tr!, tr_s, tr_s! crypt, intern, sum, unpack dump, each_byte, each_line hex, oct, to_i, to_sym reverse, reverse! succ, succ!, next, next!, upto 3. Instead of String#% method, UString#format is provided. See FORMATTING for short reference. 4. UStrings can be created via String.to_u(encoding='utf8') or global u(str,[encoding='utf8']) calls. Note that +encoding+ parameter must be value of String class. 5. There's difference between character grapheme, codepoint and codeunit. See UNICODE reports for gory details, but in short: locale dependent notion of character can be presented using more than one codepoint - base letter and combining (accents) (also possible more than one!), and each codepoint can require more than one codeunit to store (for UTF8 codeunit size is 8bit, though \ some codepoints require up to 4bytes). So, UString has normalization and locale dependent break iterators. 6. Currently UString doesn't include Enumerable module. 7. UString index/[] methods which accept URegexp, throw exception if Regexp passed. 8. UString#<=>, UString#casecmp use UCA rules. === URegexp UString uses ICU regexp library. Pattern syntax is described in [./docs/UNICODE_REGEXPS] and ICU docs. There are some differences between processing in Ruby Regexp and URegexp: 1. When UString#sub, UString#gsub are called with block, special vars ($~, $&, $1, ...) aren't set, as their values are processed through deep ruby core code. Instead, block receives UMatch object, which is essentially immutable array of matching groups: "test".u.gsub(ure("(e)(.)")) do |match| \ puts match[0] # => 'es' <--> $& puts match[1] # => 'e' \ <--> $1 puts match[2] # => 's' <--> $2 end 2. In URegexp search pattern backreferences are in form \n (\1, \2, ...), in replacement string - in form $1, $2, ... NOTE: URegexp considers char to be a digit NOT ONLY ASCII (0x0030-0x0039), but any Unicode char, which has property Decimal digit number (Nd), e.g.: a = [?$, 0x1D7D9].pack("U*").u * 2 puts a.inspect_names <U000024>DOLLAR SIGN <U01D7D9>MATHEMATICAL DOUBLE-STRUCK DIGIT ONE <U000024>DOLLAR SIGN <U01D7D9>MATHEMATICAL DOUBLE-STRUCK DIGIT ONE puts "abracadabra".u.gsub(/(b)/.U, a) abbracadabbra \ 3. One can create URegexp using global Kernel#ure function, Regexp#U, Regexp#to_u, or from UString using URegexp.new, e.g: /pattern/.U =~ "string".u 4. There are differences about Regexp and URegexp multiline matching options: t = "text\ntest" # ^,$ handling : URegexp multiline <-> Ruby default t.u =~ ure('^\w+$', URegexp::MULTILINE) => #<UMatch:0xf6f7de04 @ranges=[0..3], @cg=[\u0074\u0065\u0078\u0074]> t =~ /^\w+$/ => 0 # . matches \n : URegexp DOTALL <-> /m t.u =~ ure('.+test', URegexp::DOTALL) \ => #<UMatch:0xf6fa4d88 ... t.u =~ /.+test/m 5. UMatch.range(idx) returns range for capturing group idx. This range is in codeunits. === References 1. ICU Official Homepage http://ibm.com/software/globalization/icu/ 2. ICU downloads \ http://ibm.com/software/globalization/icu/downloads.jsp 3. ICU Home Page http://icu.sf.net 4. Unicode Home Page http://www.unicode.org ==== BUGS, DOCS, TO DO The code is slow and inefficient yet, is still highly experimental, so can have many security and memory leaks, bugs, inconsistent documentation, incomplete test suite. Use it at your own risk. Bug reports and feature requests are welcome :) === Copying This extension module is copyrighted free software by Nikolai Lugovoi. You can redistribute it and/or modify it under the terms of MIT License. Nikolai Lugovoi <meadow.nnick@gmail.com>

Abandoned. Last published 14 years ago.

uv1.0.2

RubyGems

U U extends Ruby’s Unicode support. It provides a string class called U::String with an interface that mimics that of the String class in Ruby 2.0, but that can also be used from both Ruby 1.8. This interface also has more complete Unicode support and never modifies the receiver. Thus, a U::String is an immutable value object. U comes with complete and very accurate documentation¹. The documentation can realistically also be used as a reference to the Ruby String API and may actually be preferable, as it’s a lot more explicit and complete than the documentation that comes with Ruby. ¹ See http://disu.se/software/u-1.0/api/ § Installation Install u with % gem install u § Usage Usage is basically the following: require 'u-1.0' a = 'äbc' a.upcase # ⇒ 'äBC' a.u.upcase # ⇒ 'ÄBC' That is, you require the library, then you invoke #u on a String. This’ll give you a U::String that has much better Unicode support than a normal String. It’s important to note that U only uses UTF-8, which means that #u will try to #encode the String as such. This shouldn’t be an issue in most cases, as UTF-8 is now more or less the universal encoding – and rightfully so. As U::Strings¹ are immutable value objects, there’s also a U::Buffer² available for building U::Strings efficiently. See the API³ for more complete usage information. The following sections will only cover the extensions and differences that U::String exhibit from Ruby’s built-in String class. ¹ See http://disu.se/software/u-1.0/api/U/String/ ² See http://disu.se/software/u-1.0/api/U/Buffer/ ³ See http://disu.se/software/u-1.0/api/ § Unicode Properties There are quite a few property-checking interrogators that let you check if all characters in a U::String have the given Unicode property: • #alnum?¹ • #alpha?² • #assigned?³ • #case_ignorable?⁴ • #cased?⁵ • #cntrl?⁶ • #defined?⁷ • #digit?⁸ • #graph?⁹ • #newline?¹⁰ • #print?¹¹ • #punct?¹² • #soft_dotted?¹³ • #space?¹⁴ • #title?¹⁵ • #valid?¹⁶ • #wide?¹⁷ • #wide_cjk?¹⁸ • #xdigit?¹⁹ • #zero_width?²⁰ ¹ See http://disu.se/software/u-1.0/api/U/String/#alnum-p-instance-method ² See http://disu.se/software/u-1.0/api/U/String/#alpha-p-instance-method ³ See http://disu.se/software/u-1.0/api/U/String/#assigned-p-instance-method ⁴ See http://disu.se/software/u-1.0/api/U/String/#case_ignorable-p-instance-method ⁵ See http://disu.se/software/u-1.0/api/U/String/#cased-p-instance-method ⁶ See http://disu.se/software/u-1.0/api/U/String/#cntrl-p-instance-method ⁷ See http://disu.se/software/u-1.0/api/U/String/#defined-p-instance-method ⁸ See http://disu.se/software/u-1.0/api/U/String/#digit-p-instance-method ⁹ See http://disu.se/software/u-1.0/api/U/String/#graph-p-instance-method ¹⁰ See http://disu.se/software/u-1.0/api/U/String/#newline-p-instance-method ¹¹ See http://disu.se/software/u-1.0/api/U/String/#print-p-instance-method ¹² See http://disu.se/software/u-1.0/api/U/String/#punct-p-instance-method ¹³ See http://disu.se/software/u-1.0/api/U/String/#soft_dotted-p-instance-method ¹⁴ See http://disu.se/software/u-1.0/api/U/String/#space-p-instance-method ¹⁵ See http://disu.se/software/u-1.0/api/U/String/#title-p-instance-method ¹⁶ See http://disu.se/software/u-1.0/api/U/String/#valid-p-instance-method ¹⁷ See http://disu.se/software/u-1.0/api/U/String/#wide-p-instance-method ¹⁸ See http://disu.se/software/u-1.0/api/U/String/#wide_cjk-p-instance-method ¹⁹ See http://disu.se/software/u-1.0/api/U/String/#xdigit-p-instance-method ²⁰ See http://disu.se/software/u-1.0/api/U/String/#zero_width-p-instance-method Similar to these methods are • #folded?¹ • #lower?² • #upper?³ which check whether a ‹U::String› has been cased in a given manner. ¹ See http://disu.se/software/u-1.0/api/U/String/#folded-p-instance-method ² See http://disu.se/software/u-1.0/api/U/String/#lower-p-instance-method ³ See http://disu.se/software/u-1.0/api/U/String/#upper-p-instance-method There’s also a #normalized?¹ method that checks whether a ‹U::String› has been normalized on a given form. ¹ See http://disu.se/software/u-1.0/api/U/String/#normalized-p-instance-method You can also access certain Unicode properties of the characters of a U::String: • #canonical_combining_class¹ • #general_category² • #grapheme_break³ • #line_break⁴ • #script⁵ • #word_break⁶ ¹ See http://disu.se/software/u-1.0/api/U/String/#canonical_combining_class-instance-method ² See http://disu.se/software/u-1.0/api/U/String/#general_category-instance-method ³ See http://disu.se/software/u-1.0/api/U/String/#grapheme_break-instance-method ⁴ See http://disu.se/software/u-1.0/api/U/String/#line_break-instance-method ⁵ See http://disu.se/software/u-1.0/api/U/String/#script-instance-method ⁶ See http://disu.se/software/u-1.0/api/U/String/#word_break-instance-method § Locale-specific Comparisons Comparisons of U::Strings respect the current locale (and also allow you to specify a locale to use): ‹#<=>›¹, #casecmp², and #collation_key³. ¹ See http://disu.se/software/u-1.0/api/U/String/#comparison-operator ² See http://disu.se/software/u-1.0/api/U/String/#casecmp-instance-method ³ See http://disu.se/software/u-1.0/api/U/String/#collation_key-instance-method § Additional Enumerators There are a couple of additional enumerators in #each_grapheme_cluster¹ and #each_word² (along with aliases #grapheme_clusters³ and #words⁴). ¹ See http://disu.se/software/u-1.0/api/U/String/#each_grapheme_cluster-instance-method ² See http://disu.se/software/u-1.0/api/U/String/#each_word-instance-method ³ See http://disu.se/software/u-1.0/api/U/String/#grapheme_clusters-instance-method ⁴ See http://disu.se/software/u-1.0/api/U/String/#words-instance-method § Unicode-aware Sub-sequence Removal #Chomp¹, #chop², #lstrip³, #rstrip⁴, and #strip⁵ all look for Unicode newline and space characters, rather than only ASCII ones. ¹ See http://disu.se/software/u-1.0/api/U/String/#chomp-instance-method ² See http://disu.se/software/u-1.0/api/U/String/#chop-instance-method ³ See http://disu.se/software/u-1.0/api/U/String/#lstrip-instance-method ⁴ See http://disu.se/software/u-1.0/api/U/String/#rstrip-instance-method ⁵ See http://disu.se/software/u-1.0/api/U/String/#strip-instance-method § Unicode-aware Conversions Case-shifting methods #downcase¹ and #upcase² do proper Unicode casing and the interface is further augmented by #foldcase³ and #titlecase⁴. #Mirror⁵ and #normalize⁶ do conversions similar in nature to the case-shifting methods. ¹ See http://disu.se/software/u-1.0/api/U/String/#downcase-instance-method ² See http://disu.se/software/u-1.0/api/U/String/#upcase-instance-method ³ See http://disu.se/software/u-1.0/api/U/String/#foldcase-instance-method ⁴ See http://disu.se/software/u-1.0/api/U/String/#titlecase-instance-method ⁵ See http://disu.se/software/u-1.0/api/U/String/#mirror-instance-method ⁶ See http://disu.se/software/u-1.0/api/U/String/#normalize-instance-method § Width Calculations #Width¹ will return the number of cells on a terminal that a U::String will occupy. #Center², #ljust³, and #rjust⁴ deal in width rather than length, making them much more useful for generating terminal output. #%⁵ (and its alias #format⁶) similarly deal in width. ¹ See http://disu.se/software/u-1.0/api/U/String/#width-instance-method ² See http://disu.se/software/u-1.0/api/U/String/#center-instance-method ³ See http://disu.se/software/u-1.0/api/U/String/#ljust-instance-method ⁴ See http://disu.se/software/u-1.0/api/U/String/#rjust-instance-method ⁵ See http://disu.se/software/u-1.0/api/U/String/#modulo-operator ⁶ See http://disu.se/software/u-1.0/api/U/String/#format-instance-method § Extended Type Conversions Finally, #hex¹, #oct², and #to_i³ use Unicode alpha-numerics for their respective conversions. ¹ See http://disu.se/software/u-1.0/api/U/String/#hex-instance-method ² See http://disu.se/software/u-1.0/api/U/String/#oct-instance-method ³ See http://disu.se/software/u-1.0/api/U/String/#to_i-instance-method § News § 1.0.0 Initial public release! § Financing Currently, most of my time is spent at my day job and in my rather busy private life. Please motivate me to spend time on this piece of software by donating some of your money to this project. Yeah, I realize that requesting money to develop software is a bit, well, capitalistic of me. But please realize that I live in a capitalistic society and I need money to have other people give me the things that I need to continue living under the rules of said society. So, if you feel that this piece of software has helped you out enough to warrant a reward, please PayPal a donation to now@disu.se¹. Thanks! Your support won’t go unnoticed! ¹ Send a donation: https://www.paypal.com/cgi-bin/webscr?cmd=_donations&business=now@disu.se&item_name=U § Reporting Bugs Please report any bugs that you encounter to the {issue tracker}¹. ¹ See https://github.com/now/u/issues § Authors Nikolai Weibull wrote the code, the tests, the documentation, and this README. § Licensing U is free software: you may redistribute it and/or modify it under the terms of the {GNU Lesser General Public License, version 3}¹ or later², as published by the {Free Software Foundation}³. ¹ See http://disu.se/licenses/lgpl-3.0/ ² See http://gnu.org/licenses/ ³ See http://fsf.org/

Abandoned. Last published 10 years ago.