Match characters

Posted on 2017-02-05 | Edited on 2020-09-17 | In regex

Match a character

Hexadecimal character

1	[a-fA-F0-9]

Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Nonhexadecimal character

1	[^a-fA-F0-9]

Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

The notation using square brackets is called
a character class

\, ^, -, and ] have special function in square brackets
JavaScript treats ‹[]› as an empty character class that always fails to match.
In all the regex flavors discussed in this book, a negated character class matches line break characters, unless you add them to the negated character class. Make sure that you don’t accidentally allow your regex to span across lines.

About `\w`

Java 4-6, JavaScript, PCRE, Ruby : \w -> [a-zA-Z0-9_]
Java 7 : set Pattern.UNICODE_CHARACTER_CLASS flag, \w matches unicode characters. (?U) Inner regex usage.
Python2.x : set UNICODE or U flag, matches unicode characters.
Python3.x : matches unicode characters in default, \w ASCII-only with the ASCII or A flag.
Perl>=5.1.4 : /u (Unicode) adds all Unicode scripts, and /l (locale) makes \w depend on the locale. (/d, no adlu) Unicode scripts rule same as Perl<5.1.4.
Perl<5.1.4 : matches unicode characters in default, \w automatically includes Unicode scripts if the subject string or the regex are encoded as UTF-8, or the regex includes a code point above 255 such as ‹\x{100}› or a Unicode property such as ‹\p{L}›. If not, the default for \w is pure ASCII

About `\d`

\d follows the same rules as \w in all these flavors.
In .NET, digits from other scripts are always included.
In Python it depends on the UNICODE and ASCII flags, and whether you’re using Python 2.x or 3.x. In Perl 5.14, it depends on the /adlu flags.
In earlier versions of Perl, it depends on the encoding of the subject and regex, and whether the regex has any Uncicode tokens

About `\s`

\s matches any whitespace character. This includes spaces, tabs, and line breaks.
\S matches any character not matched by \s
In .NET and JavaScript, \s also matches any character defined as whitespace by the Unicode standard.
In Java, Perl, and Python, \s follows the same rules as \w and \d.
Notice that JavaScript uses Unicode for ‹\s› but ASCII for \d and \w.

Flavor-Specific Features

1	[a-zA-Z0-9-[g-zG-Z]]

Regex options: None
Regex flavors: .NET 2.0 or later

1	[\w&&[a-fA-F0-9\s]]

uses character class intersection to match a hexadecimal digit.
Regex options: None
Regex flavors: Java

1	[a-zA-Z0-9&&[^g-zG-Z]]

uses character class subtraction to match a single hexadecimal character in a roundabout way.
Regex options: None
Regex flavors: Java

Match any character

Any character except line breaks

'.'

Regex options: None (the “dot matches line breaks” option must not be set)
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Any character including line breaks

'.'

Regex options: Dot matches line breaks
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby

'[\s\S]'

Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Inner mode

(?s)'.'

Regex options: None
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python

(?m)'.'

Regex options: None
Regex flavors: Ruby

ASCII

Posted on 2017-02-05 | Edited on 2020-09-17 | In tool

ASCII

	0	1	2	3	4	5	6	7	8	9	A	B	C	D	E	F
0	NUL	SOH	STX	ETX	EOT	ENQ	ACK	BEL	BS	HT	LF	VT	FF	CR	SO	SI
1	DLE	DC1	DC2	DC3	DC4	NAK	SYN	ETB	CAN	EM	SUB	ESC	FS	GS	RS	US
2	(space)	!	“	#	$	%	&	‘	(	)	*	+	,	-	.	/
3	0	1	2	3	4	5	6	7	8	9	:	;	<	=	>	?
4	@	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O
5	P	Q	R	S	T	U	V	W	X	Y	Z	[	\	]	^	_
6	`	a	b	c	d	e	f	g	h	i	j	k	l	m	n	o
7	p	q	r	s	t	u	v	w	x	y	z	{	\|	}	~	DEL (delete)

Contrast

NUL(null)
SOH(start of headline)
STX (start of text)
ETX (end of text)
EOT (end of transmission)
ENQ (enquiry)
ACK (acknowledge)
BEL (bell)
BS (backspace)
HT (horizontal tab)
LF (NL line feed, new line)
VT (vertical tab)
FF (NP form feed, new page)
CR (carriage return)
SO (shift out)
SI (shift in)
DLE (data link escape)
DC1 (device control 1)
DC2 (device control 2)
DC3 (device control 3)
DC4 (device control 4)
NAK (negative acknowledge)
SYN (synchronous idle)
ETB (end of trans. block)
CAN (cancel)
EM (end of medium)
SUB (substitute)
ESC (escape)
FS (file separator)
GS (group separator)
RS (record separator)
US (unit separator)

Match nonprintable characters

Posted on 2017-02-05 | Edited on 2020-09-17 | In regex

Problem

Math these characters

bell
escape
form feed
line
feed
carriage return
horizontal tab

vertical tab

1 2	hexadecimal ASCII codes : 07, 1B, 0C, 0A, 0D, 09, 0B

Solution

1	\a\e\f\n\r\t\v

Regex options: None
Regex flavors: .NET, Java, PCRE, Python, Ruby

1	\x07\x1B\f\n\r\t\v

Regex options: None
Regex flavors: .NET, Java, JavaScript, Python, Ruby

1	\a\e\f\n\r\t\x0B

Regex options: None
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby

Representation	Meaning	Hex	Regex flavors
‹\a›	bell	0x07	.NET, Java, PCRE, Perl, Python, Ruby
‹\e›	escape	0x1B	.NET, Java, PCRE, Perl, Ruby
‹\f›	form feed	0x0C	.NET, Java, JavaScript, PCRE, Perl, Python, Ruby
‹\n›	line feed (newline)	0x0A	.NET, Java, JavaScript, PCRE, Perl, Python, Ruby
‹\r›	carriage return	0x0D	.NET, Java, JavaScript, PCRE, Perl, Python, Ruby
‹\t›	horizontal tab	0x09	.NET, Java, JavaScript, PCRE, Perl, Python, Ruby
‹\v›	vertical tab	0x0B	.NET, Java, JavaScript, Python, Ruby, Perl-5.10, PCRE-7.2

Variations

1	\cG\x1B\cL\cJ\cM\cI\cK

Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Ruby 1.9

The 7-bit character set

1	\x07\x1B\x0C\x0A\x0D\x09\x0B

Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Match literal text

Posted on 2017-02-04 | Edited on 2020-09-17 | In regex

Problem

Math these literal

1	The punctuation characters in the ASCII table are: !"#$%&'()*+,-./:;<=>?@[\]^_`{\|}~

Solution

1 2	The●punctuation●characters●in●the●ASCII●table●are:●↵ !"#\$%&'\*\+,-\./:;<=>\?@\[\\]\^_`\{\\|}~

Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

1	$()*+.?[\^{\|

these characters is key words, only matches itsself
‘]’ and ‘}’ is not

Variations

1 2	The●punctuation●characters●in●the●ASCII●table●are:●↵ \Q!"#$%&'()*+,-./:;<=>?@[\]^_`{\|}~\E

Regex options: None
Regex flavors: Java 6, PCRE, Perl

use \Q \E to suppresses the meaning
of all metacharacters, including the backslash

Case-insensitive matching

Outside the regex
ascii

Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Inside the regex
(?i)ascii

Regex options: None
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

github style

Posted on 2017-02-03 | Edited on 2020-06-17 | In tool

Top to down, left to right

XXX’s GitHub repository : “offical” page
Your GitHub repository : “personal” page
Working directory : “battlefield”
Local repository : local project storage
Index : cache space

"fork" : web “offical” -> “personal”
"Pull request" : web handle to “offical”
push : local storage to remote storage
fetch : remote storage to local storage
clone : copy a remote git project
pull XXX-github : get “offical” update to local storage and merge to “battlefield”
merge : merge local storage to “battlefield”
pull my-github : get remote storage update to local storage
checkout : reset file or hash version to “battlefield” directly from local storage
checkout : reset file cover “battlefield” from cache
commit -a : handle to local storage and update to remote directly(push)
commit : handle to local storage
add : put add/delete/modify to cache
reset : reset file or hash version to cache from local storage

lucene index file constructure (2)

Posted on 2017-02-02 | Edited on 2018-12-16 | In search

1. Prefix & Suffix

Lucene negative index will storage Term Dictionary. All Term sort by ‘dictionary sequence’, but Dictionary contain all word in documents, some words have long length and it’s occupy space will be astonish. In Lucene, when some word has same prefix with is privious word[may be it’s prefix is offset from it’s privious one], word storage it’s prefix’s offset in word then suffix.

Nature storage:

[VInt = 4] [t][e][r][m]
[VInt = 10][t][e][r][m][a][g][a][n][c][y]
[VInt = 9][t][e][r][m][a][g][a][n][t]
[VInt = 8][t][e][r][m][i][n][a][l]

Compress storage:

[VInt = 4] [t][e][r][m]
[VInt = 4 (offset)][VInt = 6][a][g][a][n][c][y]
[VInt = 8 (offset)][VInt = 1][t]
[VInt = 4(offset)][VInt = 4][i][n][a][l]

$4[term]\times3+4[agan]-3[offset]=13bytes$

Reduce occupy space, especially in dictionary sequence order condition, prefixs’ has big repeat probability

2.Delta

Lucene negative index need to storage a lot of Integer number like Id, position, frequence and so on. For reduce the occupy space of Integer, when Integer increase, it only storage it’s difference value.

16386 16387 16388 16389
[(1) 000, 0010][(1) 000, 0000][(0) 000, 0001]
[(1) 000, 0011][(1) 000, 0000][(0) 000, 0001]
[(1) 000, 0100][(1) 000, 0000][(0) 000, 0001]
[(1) 000, 0101][(1) 000, 0000][(0) 000, 0001]

16386 16387 16388 16389
[(1) 000, 0010][(1) 000, 0000][(0) 000, 0001]
[(0) 000, 0001]
[(0) 000, 0001]
[(0) 000, 0001]

$2[high num bytes]\times3 = 6$

3. Flag (A, B?)

There are same condition in Lucene index constructure, after block maybe have another block , maybe not. Nature storage is add a byte to flag it’s boolean. Actually a bit is enough, put origin data move left a bit, save a bit to show it’s boolean. Data minuse 2 got it’s origin data.

Obey

.frq DocDelta[, Freq?], DocSkip[, PayloadLength?]
.prx positionDelta[, Playload?](incomplete)

Not obey

.frq SkipChildLevePointer multi skip, when last level no need flag.
.tvf Positions, Offsets

Summary

When A, B condition occur a lot, save 8 times space is economical.
For not obey, cause have a global macro, it is effective in all Field even all index.

4.Skip list

For inprove search performance Lucene use a lot of skip list data structure.

Element is sequential, in lucene either number order or alphabet order.
Skip have interval, it’s a config in advance, that is skip element number.
Skip list have level, it’s config interve point it’s up level [12 -> 5 -> 2][interval 3]

Define difference

the interval define :
- not contain up level two element
- contain head, not contain end[Lucene define]
- contain both it’s up level two element
the level define :
- contain origin list, start from 1, like 1, 2, 3
- contain origin list, start from 0, like 0, 1, 2
- not contain origin list, start from 0[Lucene define]

Jump element & Skip list

http://lucene.apache.org/core/2_9_4/fileformats.html

lucene index file constructure (1)

Posted on 2017-02-01 | Edited on 2018-12-16 | In search

Index
- In Lucene an Index is in a directory
- All files constitute an Index
Segment
- An Index could contain a lot of Segments, each Segment is independent.
  The new added document could be build into a new Segment, different Segment can be merged.
- If files prefix is same, they belong to same Segment, like “_0”, “_1”, “_2”.
- segments.gen and segments_X is Segment’s metaldata, storage it’s propertites information.
Document
- Document is the basic unit in building Index. Different Document storage in different Segment, a Segment can contain a lot of Documents.
- New added Document is in Segment new build, when Segment be merged, different document be merged into same Segment.
Field
- A document may contain different type informations, like time, content, write and so on, it can be index separately, and be storage in different Term.
- Different Term’s Index way can be different, when analysis Term’s storage, we would explain it.
Term
- Term is the basic unit in Index. It is the string after lexical analysis and language processing

名称	文件拓展名	描述
段文件	segments_N	保存了索引包含的多少段，每个段包含多少文档。
段元数据	.si	保存了索引段的元数据信息
锁文件	write.lock	防止多个IndexWriter同时写到一份索引文件中。
复合索引文件	.cfs, .cfe	把所有索引信息都存储到复合索引文件中。
索引段的域信息	.fnm	保存此段包含的域，以及域的名称和域的索引类型。
索引段的文档信息	.fdx, .fdt	保存此段包含的文档，每篇文档中包含的域以及每个域的信息。
索引段Term信息	.tim, .tip	.tim文件中存储着每个域中Term的统计信息且保存着指向 .doc, .pos, and .pay 索引文件的指针。 .tip文件保存着Term 字典的索引信息，可支持随机访问。
文档中Term词频和跳表信息	.doc	保存此段中每个文档对应的Term频率信息。
文档中Term的位置信息	.pos	保存此段中每个文档对应的Term位置信息。
文档的有效载荷和部分位置信息	.pay	保存此段中每个文档的有效载体(payload) 和 Term的位置信息(offsets)。其中有一部分的Term位置信息存储在.pos文件中。
索引字段加权因子	.nvd, .nvm	.nvm 文件保存索引字段加权因子的元数据 .nvd 文件保存索引字段加权数据
索引文档加权因子	.dvd, .dvm	.dvm 文件保存索引文档加权因子的元数据 .dvd 文件保存索引文档加权数据
索引矢量数据	.tvx, .tvd, .tvf	.tvd 存储此段文档的Term、Term频率、位置信息、有效载荷等信息。 .tvx 索引文件，用于把特定的文档加载到内存。 .tvf 保存索引字段的矢量信息。
有效文档	.liv	保存有效文档的索引文件信息

Name	Extension	Brief Description
Segments File	segments.gen, segments_N	Stores information about segments
Lock File	write.lock	The Write lock prevents multiple IndexWriters from writing to the same file.
Compound File	.cfs	An optional “virtual” file consisting of all the other index files for systems that frequently run out of file handles.
Fields	.fnm	Stores information about the fields
Field Index	.fdx	Contains pointers to field data
Field Data	.fdt	The stored fields for documents
Term Infos	.tis	Part of the term dictionary, stores term info
Term Info Index	.tii	The index into the Term Infos file
Frequencies	.frq	Contains the list of docs which contain each term along with frequency
Positions	.prx	Stores position information about where a term occurs in the index
Norms	.nrm	Encodes length and boost factors for docs and fields
Term Vector Index	.tvx	Stores offset into the document data file
Term Vector Documents	.tvd	Contains information about each document that has term vectors
Term Vector Fields	.tvf	The field level info about term vectors
Deleted Documents	.del	Info about what files are deleted

Lucene’s index not only storage positive mapping but also storage negative mapping

Positive mapping

From Index to Term : Index –> segment –> Document –> Field –> Term
Each upper floor storage it’s children floors’ matedata. Like a province, a city, a county, they got it’s chilren’s info.
- segments_N : how many segment the Index have, how many Documents each segment have.
- .fnm : how many Fields the segment contain, each Field’s name and Index way.
- .fdx , .fdt : all Documents the segment have, how many Fields each Document have, what information each field recorded.
- .tvx , .tvd , .tvf : how many Documents the segment have, how many Fields each Document have, how many words each Field have, every words’ string, position, and so on.
  Negative mapping
Term -> Document
- .tis , .tii : Term dictionary, that is segment’s words sort by alphabet sequencely.
- .frq : posting sorted table, that is table that contain all words’ Document ID.
- .prx : the word position in Document at posting sorted table.

Primary Type

Byte : the most basic type, 8 bits long.
UInt32 : composed by 4 Bytes.
UInt64 : composed by 8 Bytes.
VInt :
- May be composed by many Bytes.
- Front byte represent lower number bit.
- For example: 51271 - [1]1000111, [1]0010000, [0]0000011
Chars : UTF-8 encoding bytes.
String : first a VInt represent Char numbers, then a series of Chars.

Full-text retrieval fundamental (2)

Posted on 2017-01-30 | Edited on 2018-12-16 | In search

How to build Index

1. Prepare origin Document

file1: Researches of Chinese full-text search technologies based on word indexing is related to many fields.
file2: The Index Data Service provides basic full-text functions for storage and retrieval of terms and indexed summary documents.
2. Put Document TOKENIZER

split Document into words
separate symbols
separate Stop word

Stop word in english like: “like”, “a”, “this”…
After Tokenier got Token:
“Researches” “Chinese” “full” “text” “search” “technologies” “word” “indexing” “related” “many” “fields” “Index” “Data” “Service” “provides” “basic” “full” “text” “functions” “storage” “retrieval” “terms” “indexed” “summary” “documents”

3. Put TOKEN to LINGUISTIC PROCESSOR

to Lowercase
words reduce to root type like “fields” to “field” stemming
words to origin type like “indexed” to “index” lemmatization

the difference between “Stemming” and “lemmatization”
- same: make words to initial
- difference:
  - Stemming is reduce
  - lemmatization is change
- difference in algorithm:
  - Stemming is delete “s”, “ing”->”e”, “ational”->”ate”, “tional”-> “tion”
  - lemmatization is “drove” -> “drive”
- they are not mutex, but mates
After linguistic processor result be call Term:
“researche” “chinese” “full” “text” “search” “technologie” “word” “index” “relate” “many” “field” “index” “data” “service” “provide” “basic” “full” “text” “function” “storage” “retrieve” “term” “index” “summary” “document”

Because the linguistic processor when search drove, drive’s documents can be found.

4. Put TERM to INDEXER

Build a dictionary in Term

Term	Document ID
researche	1
chinese	1
full	1
text	1
search	1
technologie	1
word	1
index	1
relate	1
many	1
field	1
index	2
data	2
service	2
provide	2
basic	2
full	2
text	2
function	2
storage	2
retrieve	2
term	2
index	2
summary	2
document	2

sort table by key’s first letter

Term	Document ID
basic	2
chinese	1
data	2
document	2
field	1
full	1
full	2
function	2
index	1
index	2
index	2
many	1
provide	2
relate	1
researche	1
retrieve	2
search	1
service	2
storage	2
summary	2
technologie	1
term	2
text	1
text	2
word	1

merge same Term into Posting List

Term-[Document Frequency] | DocumentID-Frequency
basic-1        2-1
chinese-1     1-1
data-1        2-1
document-1    2-1
field-1       1-1
full-2        1-1 -> 2-1
function-1    2-1
index-2       2-2 -> 1-1
many-1        1-1
provide-1     2-1
relate-1      1-1
researche-1   1-1
retrieve-2    2-1
search-1      1-1
service-2     2-1
storage-2     2-1
summary-2     2-1
technologie-1 1-1
term-2        2-1
text-2        2-1 -> 1-1
word-1        1-1

Document Frequency: Document appear times
Frequency: Term appear times in Document

When searching “drive” “driving” “drove” “driven” will be processor to drive like build Index process

Full-text retrieval fundamental (3)

Posted on 2017-01-30 | Edited on 2018-12-16 | In search

How to search in Index

How to find the result you want most, or the most relevant result with query phrase?

1. first: User input query pharse

The basic query grammar like “AND”, “OR”, “NOT”.
elephant and tiger or lion not sheep

2. second: Query pharse lexical analysis, syntax analysis, and language processing

lexical analysis used to distinguish words and key words
syntax analysis used to build a grammar tree by syntax rule
language processing just like the processe in building Index. check page 2

3. third: Traverse Index, get result that fit syntax tree

first find the documents contain words(elephant tiger lion) in the posting list.
second merge documents contain both elephant, tiger or lion.
third separate documents contain sheep, got result contain both elephant, tiger or lion not sheep.

4. fourth: Sort result by relevant between query pharse and search result

How to calculate the relevance between documents and query pharse?
Toke query pharse as a shot document, scoring relevance between documents, the higher score is the higher rank document is.
How to score the relevance between documents?
It’s not easy a thing

Check what is the important factors between documents
Check the relation between these factors

The process of finding the importance of a word (Term) to a document is called the weight (Term) process.
The process of judging the relationship between Term to get document relevance using an algorithm called Vector Space Model

Term weight process

This is a simple classic implementation, lucene’s have a little difference

$w_{t_\eta}{_d} = tf_{t_\eta}{_d} \times log(n/df_t)$

$w{t\eta}{_d}$ = the weight of the term t in document d
$tf{t\eta}{_d}$ = frequency of term t in document d
$n$ = total number of documents
$df_t$ = the number of documents that contain term t

Term Frequency (tf): How many times this Term show in this document, the bigger tf is , the more importance this Term is.
Document Frequency (df)：How many documents contain this Term, the bigger df is, the less importance this Term is.

Like programmer, the deeper technology you leanr is better(tf big), the less technology people know is better (df little). When finding job your competitive power would be grate. Man’s value is about unsubsititutability

Vector Space Model

Less the two vector’s angle is the more relevance is
We take two vector’s consine as the score point

Full-text retrieval fundamental (1)

Posted on 2017-01-29 | Edited on 2018-12-16 | In search

Introduce

结构化数据: 指有规律结构固定格式长度的数据, 如数据库
非结构化数据: 指无规律不定长不固定格式的数据, 如邮件
半结构化数据: XML/HTML等, 可按需求以不同形式处理

非结构化数据又一种叫法叫全文数据。

全文检索大体分两个过程，索引创建(Indexing)和搜索索引(Search)。

索引创建：将现实世界中所有的结构化和非结构化数据提取信息，创建索引的过程。
搜索索引：就是得到用户的查询请求，搜索创建的索引，然后返回结果的过程。

全文检索就存在三个重要问题：

索引里面究竟存些什么？(Index)
如何创建索引？(Indexing)
如何对索引进行搜索？(Search)

What fuck in these Index

存的是 符号表 通俗的说是个映射表
从字符串到文件的映射是文件到字符串映射的反向过程，于是保存这种信息的索引称为 反向索引 。
左边的一系列为词典也叫Key
右边的一系列为倒排表也叫Value

建索引的好处是一次建立, 多次使用, 如果建索引的频率过于频繁反而会拖累整体性能

Match a character

About \w

About \d

About \s

Flavor-Specific Features

Match any character

Any character except line breaks

Any character including line breaks

Inner mode

ASCII

Contrast

Problem

Solution

Variations

Problem

Solution

Variations

Case-insensitive matching

1. Prefix & Suffix

2.Delta

3. Flag (A, B?)

Obey

Not obey

Summary

4.Skip list

Define difference

Lucene’s index not only storage positive mapping but also storage negative mapping

Positive mapping

Negative mapping

Primary Type

How to build Index

1. Prepare origin Document

2. Put Document TOKENIZER

3. Put TOKEN to LINGUISTIC PROCESSOR

4. Put TERM to INDEXER

How to search in Index

1. first: User input query pharse

2. second: Query pharse lexical analysis, syntax analysis, and language processing

3. third: Traverse Index, get result that fit syntax tree

4. fourth: Sort result by relevant between query pharse and search result

Term weight process

Vector Space Model

Introduce

What fuck in these Index

About `\w`

About `\d`

About `\s`