BurningBright

  • Home

  • Tags

  • Categories

  • Archives

  • Search

Match characters

Posted on 2017-02-05 | Edited on 2020-09-17 | In regex

Match a character

Hexadecimal character

1
[a-fA-F0-9]

  • Regex options: None
  • Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Nonhexadecimal character

1
[^a-fA-F0-9]

  • Regex options: None
  • Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

The notation using square brackets is called
a character class

  • \, ^, -, and ] have special function in square brackets

  • JavaScript treats ‹[]› as an empty character class that always fails to match.

  • In all the regex flavors discussed in this book, a negated character class matches line break characters, unless you add them to the negated character class. Make sure that you don’t accidentally allow your regex to span across lines.


About \w

  • Java 4-6, JavaScript, PCRE, Ruby : \w -> [a-zA-Z0-9_]
  • Java 7 : set Pattern.UNICODE_CHARACTER_CLASS flag, \w matches unicode characters. (?U) Inner regex usage.
  • Python2.x : set UNICODE or U flag, matches unicode characters.
  • Python3.x : matches unicode characters in default, \w ASCII-only with the ASCII or A flag.
  • Perl>=5.1.4 : /u (Unicode) adds all Unicode scripts, and /l (locale) makes \w depend on the locale. (/d, no adlu) Unicode scripts rule same as Perl<5.1.4.
  • Perl<5.1.4 : matches unicode characters in default, \w automatically includes Unicode scripts if the subject string or the regex are encoded as UTF-8, or the regex includes a code point above 255 such as ‹\x{100}› or a Unicode property such as ‹\p{L}›. If not, the default for \w is pure ASCII

About \d

  • \d follows the same rules as \w in all these flavors.
  • In .NET, digits from other scripts are always included.
  • In Python it depends on the UNICODE and ASCII flags, and whether you’re using Python 2.x or 3.x. In Perl 5.14, it depends on the /adlu flags.
  • In earlier versions of Perl, it depends on the encoding of the subject and regex, and whether the regex has any Uncicode tokens

About \s

  • \s matches any whitespace character. This includes spaces, tabs, and line breaks.
  • \S matches any character not matched by \s
  • In .NET and JavaScript, \s also matches any character defined as whitespace by the Unicode standard.
  • In Java, Perl, and Python, \s follows the same rules as \w and \d.
  • Notice that JavaScript uses Unicode for ‹\s› but ASCII for \d and \w.

Flavor-Specific Features

1
[a-zA-Z0-9-[g-zG-Z]]
  • Regex options: None
  • Regex flavors: .NET 2.0 or later

1
[\w&&[a-fA-F0-9\s]]
  • uses character class intersection to match a hexadecimal digit.
  • Regex options: None
  • Regex flavors: Java
1
[a-zA-Z0-9&&[^g-zG-Z]]
  • uses character class subtraction to match a single hexadecimal character in a roundabout way.
  • Regex options: None
  • Regex flavors: Java

Match any character

Any character except line breaks

'.'

  • Regex options: None (the “dot matches line breaks” option must not be set)
  • Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Any character including line breaks

'.'

  • Regex options: Dot matches line breaks
  • Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby

'[\s\S]'

  • Regex options: None
  • Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Inner mode

(?s)'.'

  • Regex options: None
  • Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python

(?m)'.'

  • Regex options: None
  • Regex flavors: Ruby

ASCII

Posted on 2017-02-05 | Edited on 2020-09-17 | In tool

ASCII

0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 (space) ! “ # $ % & ‘ ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ \ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL (delete)

Contrast

  • NUL(null)
  • SOH(start of headline)
  • STX (start of text)
  • ETX (end of text)
  • EOT (end of transmission)
  • ENQ (enquiry)
  • ACK (acknowledge)
  • BEL (bell)
  • BS (backspace)
  • HT (horizontal tab)
  • LF (NL line feed, new line)
  • VT (vertical tab)
  • FF (NP form feed, new page)
  • CR (carriage return)
  • SO (shift out)
  • SI (shift in)
  • DLE (data link escape)
  • DC1 (device control 1)
  • DC2 (device control 2)
  • DC3 (device control 3)
  • DC4 (device control 4)
  • NAK (negative acknowledge)
  • SYN (synchronous idle)
  • ETB (end of trans. block)
  • CAN (cancel)
  • EM (end of medium)
  • SUB (substitute)
  • ESC (escape)
  • FS (file separator)
  • GS (group separator)
  • RS (record separator)
  • US (unit separator)

Match nonprintable characters

Posted on 2017-02-05 | Edited on 2020-09-17 | In regex

Problem

Math these characters

  • bell
  • escape
  • form feed
  • line
  • feed
  • carriage return
  • horizontal tab
  • vertical tab
    1
    2
    hexadecimal ASCII codes :
    07, 1B, 0C, 0A, 0D, 09, 0B

Solution

1
\a\e\f\n\r\t\v
  • Regex options: None
  • Regex flavors: .NET, Java, PCRE, Python, Ruby
1
\x07\x1B\f\n\r\t\v
  • Regex options: None
  • Regex flavors: .NET, Java, JavaScript, Python, Ruby
1
\a\e\f\n\r\t\x0B
  • Regex options: None
  • Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
Representation Meaning Hex Regex flavors
‹\a› bell 0x07 .NET, Java, PCRE, Perl, Python, Ruby
‹\e› escape 0x1B .NET, Java, PCRE, Perl, Ruby
‹\f› form feed 0x0C .NET, Java, JavaScript, PCRE, Perl, Python, Ruby
‹\n› line feed (newline) 0x0A .NET, Java, JavaScript, PCRE, Perl, Python, Ruby
‹\r› carriage return 0x0D .NET, Java, JavaScript, PCRE, Perl, Python, Ruby
‹\t› horizontal tab 0x09 .NET, Java, JavaScript, PCRE, Perl, Python, Ruby
‹\v› vertical tab 0x0B .NET, Java, JavaScript, Python, Ruby, Perl-5.10, PCRE-7.2

Variations

1
\cG\x1B\cL\cJ\cM\cI\cK
  • Regex options: None
  • Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Ruby 1.9

The 7-bit character set

1
\x07\x1B\x0C\x0A\x0D\x09\x0B

  • Regex options: None
  • Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Match literal text

Posted on 2017-02-04 | Edited on 2020-09-17 | In regex

Problem

Math these literal

1
The punctuation characters in the ASCII table are: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

Solution

1
2
The●punctuation●characters●in●the●ASCII●table●are:●↵
!"#\$%&'\(\)\*\+,-\./:;<=>\?@\[\\]\^_`\{\|}~
  • Regex options: None
  • Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby
1
$()*+.?[\^{|
  • these characters is key words, only matches itsself
  • ‘]’ and ‘}’ is not

Variations

1
2
The●punctuation●characters●in●the●ASCII●table●are:●↵
\Q!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~\E
  • Regex options: None
  • Regex flavors: Java 6, PCRE, Perl

use \Q \E to suppresses the meaning
of all metacharacters, including the backslash


Case-insensitive matching

Outside the regex
ascii

  • Regex options: Case insensitive
  • Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Inside the regex
(?i)ascii

  • Regex options: None
  • Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

github style

Posted on 2017-02-03 | Edited on 2020-06-17 | In tool


Top to down, left to right

  • XXX’s GitHub repository : “offical” page
  • Your GitHub repository : “personal” page
  • Working directory : “battlefield”
  • Local repository : local project storage
  • Index : cache space
  1. "fork" : web “offical” -> “personal”
  2. "Pull request" : web handle to “offical”
  3. push : local storage to remote storage
  4. fetch : remote storage to local storage
  5. clone : copy a remote git project
  6. pull XXX-github : get “offical” update to local storage and merge to “battlefield”
  7. merge : merge local storage to “battlefield”
  8. pull my-github : get remote storage update to local storage
  9. checkout : reset file or hash version to “battlefield” directly from local storage
  10. checkout : reset file cover “battlefield” from cache
  11. commit -a : handle to local storage and update to remote directly(push)
  12. commit : handle to local storage
  13. add : put add/delete/modify to cache
  14. reset : reset file or hash version to cache from local storage

lucene index file constructure (2)

Posted on 2017-02-02 | Edited on 2018-12-16 | In search

1. Prefix & Suffix

Lucene negative index will storage Term Dictionary. All Term sort by ‘dictionary sequence’, but Dictionary contain all word in documents, some words have long length and it’s occupy space will be astonish. In Lucene, when some word has same prefix with is privious word[may be it’s prefix is offset from it’s privious one], word storage it’s prefix’s offset in word then suffix.

Nature storage:

1
2
3
4
[VInt = 4] [t][e][r][m]
[VInt = 10][t][e][r][m][a][g][a][n][c][y]
[VInt = 9][t][e][r][m][a][g][a][n][t]
[VInt = 8][t][e][r][m][i][n][a][l]

Compress storage:

1
2
3
4
[VInt = 4] [t][e][r][m]
[VInt = 4 (offset)][VInt = 6][a][g][a][n][c][y]
[VInt = 8 (offset)][VInt = 1][t]
[VInt = 4(offset)][VInt = 4][i][n][a][l]

Reduce occupy space, especially in dictionary sequence order condition, prefixs’ has big repeat probability

2.Delta

Lucene negative index need to storage a lot of Integer number like Id, position, frequence and so on. For reduce the occupy space of Integer, when Integer increase, it only storage it’s difference value.

1
2
3
4
5
16386 16387 16388 16389
[(1) 000, 0010][(1) 000, 0000][(0) 000, 0001]
[(1) 000, 0011][(1) 000, 0000][(0) 000, 0001]
[(1) 000, 0100][(1) 000, 0000][(0) 000, 0001]
[(1) 000, 0101][(1) 000, 0000][(0) 000, 0001]
1
2
3
4
5
16386 16387 16388 16389
[(1) 000, 0010][(1) 000, 0000][(0) 000, 0001]
[(0) 000, 0001]
[(0) 000, 0001]
[(0) 000, 0001]

3. Flag (A, B?)

There are same condition in Lucene index constructure, after block maybe have another block , maybe not. Nature storage is add a byte to flag it’s boolean. Actually a bit is enough, put origin data move left a bit, save a bit to show it’s boolean. Data minuse 2 got it’s origin data.

Obey

  • .frq DocDelta[, Freq?], DocSkip[, PayloadLength?]
  • .prx positionDelta[, Playload?](incomplete)

Not obey

  • .frq SkipChildLevePointer multi skip, when last level no need flag.
  • .tvf Positions, Offsets

Summary

  • When A, B condition occur a lot, save 8 times space is economical.
  • For not obey, cause have a global macro, it is effective in all Field even all index.

4.Skip list

For inprove search performance Lucene use a lot of skip list data structure.

  • Element is sequential, in lucene either number order or alphabet order.
  • Skip have interval, it’s a config in advance, that is skip element number.
  • Skip list have level, it’s config interve point it’s up level [12 -> 5 -> 2][interval 3]

Define difference

  • the interval define :

    • not contain up level two element
    • contain head, not contain end[Lucene define]
    • contain both it’s up level two element
  • the level define :

    • contain origin list, start from 1, like 1, 2, 3
    • contain origin list, start from 0, like 0, 1, 2
    • not contain origin list, start from 0[Lucene define]

Jump element & Skip list

http://lucene.apache.org/core/2_9_4/fileformats.html

lucene index file constructure (1)

Posted on 2017-02-01 | Edited on 2018-12-16 | In search
  • Index
    • In Lucene an Index is in a directory
    • All files constitute an Index
  • Segment
    • An Index could contain a lot of Segments, each Segment is independent.
      The new added document could be build into a new Segment, different Segment can be merged.
    • If files prefix is same, they belong to same Segment, like “_0”, “_1”, “_2”.
    • segments.gen and segments_X is Segment’s metaldata, storage it’s propertites information.
  • Document
    • Document is the basic unit in building Index. Different Document storage in different Segment, a Segment can contain a lot of Documents.
    • New added Document is in Segment new build, when Segment be merged, different document be merged into same Segment.
  • Field
    • A document may contain different type informations, like time, content, write and so on, it can be index separately, and be storage in different Term.
    • Different Term’s Index way can be different, when analysis Term’s storage, we would explain it.
  • Term
    • Term is the basic unit in Index. It is the string after lexical analysis and language processing
名称 文件拓展名 描述
段文件 segments_N 保存了索引包含的多少段,每个段包含多少文档。
段元数据 .si 保存了索引段的元数据信息
锁文件  write.lock 防止多个IndexWriter同时写到一份索引文件中。
复合索引文件 .cfs, .cfe 把所有索引信息都存储到复合索引文件中。
索引段的域信息 .fnm 保存此段包含的域,以及域的名称和域的索引类型。
索引段的文档信息 .fdx, .fdt 保存此段包含的文档,每篇文档中包含的域以及每个域的信息。
索引段Term信息 .tim, .tip .tim文件中存储着每个域中Term的统计信息且保存着指向 .doc, .pos, and .pay 索引文件的指针。 .tip文件保存着Term 字典的索引信息,可支持随机访问。
文档中Term词频和跳表信息 .doc 保存此段中每个文档对应的Term频率信息。
文档中Term的位置信息 .pos 保存此段中每个文档对应的Term位置信息。
文档的有效载荷和部分位置信息 .pay 保存此段中每个文档的有效载体(payload) 和 Term的位置信息(offsets)。 其中有一部分的Term位置信息存储在.pos文件中。
索引字段加权因子 .nvd, .nvm .nvm 文件保存索引字段加权因子的元数据 .nvd 文件保存索引字段加权数据
索引文档加权因子 .dvd, .dvm .dvm 文件保存索引文档加权因子的元数据 .dvd 文件保存索引文档加权数据
索引矢量数据 .tvx, .tvd, .tvf .tvd 存储此段文档的Term、Term频率、位置信息、有效载荷等信息。 .tvx 索引文件,用于把特定的文档加载到内存。 .tvf 保存索引字段的矢量信息。
有效文档 .liv 保存有效文档的索引文件信息
Name Extension Brief Description
Segments File segments.gen, segments_N Stores information about segments
Lock File write.lock The Write lock prevents multiple IndexWriters from writing to the same file.
Compound File .cfs An optional “virtual” file consisting of all the other index files for systems that frequently run out of file handles.
Fields .fnm Stores information about the fields
Field Index .fdx Contains pointers to field data
Field Data .fdt The stored fields for documents
Term Infos .tis Part of the term dictionary, stores term info
Term Info Index .tii The index into the Term Infos file
Frequencies .frq Contains the list of docs which contain each term along with frequency
Positions .prx Stores position information about where a term occurs in the index
Norms .nrm Encodes length and boost factors for docs and fields
Term Vector Index .tvx Stores offset into the document data file
Term Vector Documents .tvd Contains information about each document that has term vectors
Term Vector Fields .tvf The field level info about term vectors
Deleted Documents .del Info about what files are deleted

Lucene’s index not only storage positive mapping but also storage negative mapping

Positive mapping

  • From Index to Term : Index –> segment –> Document –> Field –> Term
  • Each upper floor storage it’s children floors’ matedata. Like a province, a city, a county, they got it’s chilren’s info.
    • segments_N : how many segment the Index have, how many Documents each segment have.
    • .fnm : how many Fields the segment contain, each Field’s name and Index way.
    • .fdx , .fdt : all Documents the segment have, how many Fields each Document have, what information each field recorded.
    • .tvx , .tvd , .tvf : how many Documents the segment have, how many Fields each Document have, how many words each Field have, every words’ string, position, and so on.

      Negative mapping

  • Term -> Document
    • .tis , .tii : Term dictionary, that is segment’s words sort by alphabet sequencely.
    • .frq : posting sorted table, that is table that contain all words’ Document ID.
    • .prx : the word position in Document at posting sorted table.

Primary Type

  • Byte : the most basic type, 8 bits long.
  • UInt32 : composed by 4 Bytes.
  • UInt64 : composed by 8 Bytes.
  • VInt :
    • May be composed by many Bytes.
    • Front byte represent lower number bit.
    • For example: 51271 - [1]1000111, [1]0010000, [0]0000011
  • Chars : UTF-8 encoding bytes.
  • String : first a VInt represent Char numbers, then a series of Chars.

Full-text retrieval fundamental (2)

Posted on 2017-01-30 | Edited on 2018-12-16 | In search

How to build Index

1. Prepare origin Document

  • file1: Researches of Chinese full-text search technologies based on word indexing is related to many fields.
  • file2: The Index Data Service provides basic full-text functions for storage and retrieval of terms and indexed summary documents.

    2. Put Document TOKENIZER

  1. split Document into words
  2. separate symbols
  3. separate Stop word

    Stop word in english like: “like”, “a”, “this”…
    After Tokenier got Token:
    “Researches” “Chinese” “full” “text” “search” “technologies” “word” “indexing” “related” “many” “fields” “Index” “Data” “Service” “provides” “basic” “full” “text” “functions” “storage” “retrieval” “terms” “indexed” “summary” “documents”

3. Put TOKEN to LINGUISTIC PROCESSOR

  1. to Lowercase
  2. words reduce to root type like “fields” to “field” stemming
  3. words to origin type like “indexed” to “index” lemmatization

    the difference between “Stemming” and “lemmatization”

    • same: make words to initial
    • difference:

      • Stemming is reduce
      • lemmatization is change
    • difference in algorithm:

      • Stemming is delete “s”, “ing”->”e”, “ational”->”ate”, “tional”-> “tion”
      • lemmatization is “drove” -> “drive”
    • they are not mutex, but mates

    After linguistic processor result be call Term:
    “researche” “chinese” “full” “text” “search” “technologie” “word” “index” “relate” “many” “field” “index” “data” “service” “provide” “basic” “full” “text” “function” “storage” “retrieve” “term” “index” “summary” “document”

    Because the linguistic processor when search drove, drive’s documents can be found.

4. Put TERM to INDEXER

  1. Build a dictionary in Term
Term Document ID
researche 1
chinese 1
full 1
text 1
search 1
technologie 1
word 1
index 1
relate 1
many 1
field 1
index 2
data 2
service 2
provide 2
basic 2
full 2
text 2
function 2
storage 2
retrieve 2
term 2
index 2
summary 2
document 2
  1. sort table by key’s first letter
Term Document ID
basic 2
chinese 1
data 2
document 2
field 1
full 1
full 2
function 2
index 1
index 2
index 2
many 1
provide 2
relate 1
researche 1
retrieve 2
search 1
service 2
storage 2
summary 2
technologie 1
term 2
text 1
text 2
word 1
  1. merge same Term into Posting List
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    Term-[Document Frequency] | DocumentID-Frequency
    basic-1 2-1
    chinese-1 1-1
    data-1 2-1
    document-1 2-1
    field-1 1-1
    full-2 1-1 -> 2-1
    function-1 2-1
    index-2 2-2 -> 1-1
    many-1 1-1
    provide-1 2-1
    relate-1 1-1
    researche-1 1-1
    retrieve-2 2-1
    search-1 1-1
    service-2 2-1
    storage-2 2-1
    summary-2 2-1
    technologie-1 1-1
    term-2 2-1
    text-2 2-1 -> 1-1
    word-1 1-1
  • Document Frequency: Document appear times
  • Frequency: Term appear times in Document

When searching “drive” “driving” “drove” “driven” will be processor to drive like build Index process

Full-text retrieval fundamental (3)

Posted on 2017-01-30 | Edited on 2018-12-16 | In search

How to search in Index

How to find the result you want most, or the most relevant result with query phrase?

1. first: User input query pharse

The basic query grammar like “AND”, “OR”, “NOT”.
elephant and tiger or lion not sheep

2. second: Query pharse lexical analysis, syntax analysis, and language processing

  1. lexical analysis used to distinguish words and key words
  2. syntax analysis used to build a grammar tree by syntax rule
  3. language processing just like the processe in building Index. check page 2

3. third: Traverse Index, get result that fit syntax tree

  1. first find the documents contain words(elephant tiger lion) in the posting list.
  2. second merge documents contain both elephant, tiger or lion.
  3. third separate documents contain sheep, got result contain both elephant, tiger or lion not sheep.

4. fourth: Sort result by relevant between query pharse and search result

How to calculate the relevance between documents and query pharse?
Toke query pharse as a shot document, scoring relevance between documents, the higher score is the higher rank document is.
How to score the relevance between documents?
It’s not easy a thing

  • Check what is the important factors between documents
  • Check the relation between these factors

  • The process of finding the importance of a word (Term) to a document is called the weight (Term) process.
  • The process of judging the relationship between Term to get document relevance using an algorithm called Vector Space Model

Term weight process

This is a simple classic implementation, lucene’s have a little difference

  • $w{t\eta}{_d}$ = the weight of the term t in document d
  • $tf{t\eta}{_d}$ = frequency of term t in document d
  • $n$ = total number of documents
  • $df_t$ = the number of documents that contain term t
  • Term Frequency (tf): How many times this Term show in this document, the bigger tf is , the more importance this Term is.
  • Document Frequency (df):How many documents contain this Term, the bigger df is, the less importance this Term is.

Like programmer, the deeper technology you leanr is better(tf big), the less technology people know is better (df little). When finding job your competitive power would be grate. Man’s value is about unsubsititutability

Vector Space Model

Less the two vector’s angle is the more relevance is
We take two vector’s consine as the score point

Full-text retrieval fundamental (1)

Posted on 2017-01-29 | Edited on 2018-12-16 | In search

Introduce

  • 结构化数据: 指有规律结构固定格式长度的数据, 如数据库
  • 非结构化数据: 指无规律不定长不固定格式的数据, 如邮件
  • 半结构化数据: XML/HTML等, 可按需求以不同形式处理

非结构化数据又一种叫法叫全文数据。

全文检索大体分两个过程,索引创建(Indexing)和搜索索引(Search)。

  • 索引创建:将现实世界中所有的结构化和非结构化数据提取信息,创建索引的过程。
  • 搜索索引:就是得到用户的查询请求,搜索创建的索引,然后返回结果的过程。

全文检索就存在三个重要问题:

  1. 索引里面究竟存些什么?(Index)
  2. 如何创建索引?(Indexing)
  3. 如何对索引进行搜索?(Search)

What fuck in these Index

  • 存的是 符号表 通俗的说是个映射表
  • 从字符串到文件的映射是文件到字符串映射的反向过程,于是保存这种信息的索引称为 反向索引 。
  • 左边的一系列为词典也叫Key
  • 右边的一系列为倒排表也叫Value

建索引的好处是一次建立, 多次使用, 如果建索引的频率过于频繁反而会拖累整体性能

1…26272829

Leon

282 posts
20 categories
58 tags
GitHub
Links
  • clock
  • typing-cn
  • mathjax
  • katex
  • cron
  • dos
  • keyboard
  • regex
  • sql
  • toy
© 2017 – 2024 Leon
Powered by Hexo v3.9.0
|
Theme – NexT.Muse v7.1.2