Wednesday, January 13, 2010

Encoding and File in .NET

  1. Contents can be stored in a file in various formats: ASCII, Unicode, BASE64, Binary or others.
  2. Unicode contents allows a optional byte order mark(BOM) to be placed at the beginning of the file to indicate the byte order and signal the unicode representation (utf8/utf16/utf32) of the content which can be used by a file consumer to detect the format.
  3. File contents can be represented in .NET program as text (string, char[]) or binary (byte[]). Different representations can be obtained using different IO classes. 
  4. StreamReader/StreamWriter classes are dealing with text contents. That means they need a way to transform text to and from stream. This is accomplished by the encoding object associated to the StreamReader/StreamWriter classe. The associated encoding object's constructor allows you to set if the BOM will be written to (StreamWriter) or auto detected (StreamReader) from the file. The unicode formate if detected will be used other than the one being associated.
  5. Byte array can be obtained from the file, for example, using File.ReadAllBytes. Then the GetString method of a encoding class can be used to convert it into text if you know the format used to store the file. You can use the GetBytes method of an encoding class to conver text content into encoded byte array.
  6. There are many ways to deal with files, but encoding almost always plays a role either behind the scene or used by your code explicitly.

Wednesday, January 6, 2010

.NET Framework Regular Expression Object Model

The Regular Expression Engine

The regular expression engine in the .NET Framework is represented by the Regex class. The class provides sets of instance and static methods to achieve similar purposes. Regex class will be discussed more later.

Regular Expression Object Model includes some important classes. The Match class inherits from Group class which in turn inherits from the Capture class. Therefore both Match and Group are captures but in different sense. Match.value (value is inherited from Capture) reflects the whole matched string, but Group.value reflects the captured string for the group. The Match.Groups[0] is a special group. It always exists and represents the entire matched stirng that equals to Match.value.Other groups may exist if grouping is defined in the pattern. Subsequent Group[>0].value will contain only string that matches the pattern defined for the group.A group has more than 1 capture only when it has a quantifier defined.

You can call the methods of the Regex class to perform the following operations:
  • Determine whether a string matches a regular expression pattern.
  • Extract a single match or the first match.
  • Extract all matches.
  • Replace a matched substring.
  • Split a single string into an array of strings.
Determine whether a string matches a regular expression pattern
Sample: Regex.IsMatch("abcabc abcabce", @"(abc){2}e?")
This returns true because there is at lease one match found in the input string.

Extract a single match or the first match
Sample: Match match = Regex.Match("abcabc abcabce", @"(abc){2}e?")
This returns the first match found in the input string. The match.value in the sample returns "abcabc". In this case, match.groups[0].value = "abcabc", and match.groups[1].value = "abc".

Extract all matches
Sample: MatchCollection matches = Regex.Matches("abcabc abcabce", @"(abc){2}e?")
This returns all matches found in the input string. The matches[0].value in the sample returns "abcabc" and matches[1].value in the sample returns "abcabce".


Replace a matched substring
Sample: Regex.Replace("abcabc abcabce", @"(abc){2}e?", @"$1xyz")
This returns the input string with the mathed substrings replaced by the replacement string which, in this case, is "abcxyz". The result is "abcxyz abcxyz".
You can also use Match.Result("replace string") to replace matched string in a single match. Regex.Match("abc", "(?<ab>ab)c").Result("${ab}de") will return "abde".

Split a single string into an array of strings
Sample: Regex.Matches("abcabc abcabce", @"[\w\s]").
This will return an array of matches consisting of 14 characters and essentially parsing the string into a character array.