A Short, Friendly GUID/UUID in .NET

I recently committed to returning a unique ID that was 18 characters or less as part of the response from a new version of an existing web method I support. This value would be stored and used for subsequent calls to other web methods and would potentially be manually entered into a website query page.

Easy, you say. Just use an IDENTITY (or SEQUENCE in Oracle) column (which is already in place as a surrogate key, BTW). The problem with IDENTITY columns is that you have to wait for the call to the database to complete before you know what the value will be. In this scenario, we are using MSMQ to enhance the performance and reliability of the operation — this allows us to return from the web method call without waiting on the DB (which *usually* is quite fast but slows down for short periods of time infrequently). This web method gets called very frequently and the calls can be bursty, so the queuing maintains a very quick response time under all circumstances.

I had planned all along to use a GUID/UNIQUEIDENTIFIER column in the underlying table with the GUID generated during the web method call and written to the write queue with the rest of the data. Unfortunately, I confused two numbers in my head when I made this commitment. GUIDs are 16 *bytes* long but their default display format is actually 32 hex digits with 4 dashes thrown in for good measure.

I didn’t want to go back to the client and ask for them to increase the size of this field. Especially since this was going to be a human queriable field on a web site. I considered base 36 encoding the GUID, but that would still take 25 digits. Next I considered Base64 Encoding a GUID which is actually quite clever but it still came up “short” (or should I say long) by 4 digits compressing the GUID down to 22 digits. But still a nice improvement over 36.

Rick Strahl (a very prolific, well-written .net blogger) contemplated using GetHashCode().ToString(“x”) on a variety of targets, but I couldn’t get comfortable with statistical validity of this approach.

I considered using a combination of a time-based value (such as DateTime.Ticks) but I could not get past the bursty usage problem with those. I also felt that I must be re-inventing the wheel. The GUID was the “best practice” and I couldn’t help but feel I should utilize that approach or at least a similar one.

The “G” in GUID stands for global. I really didn’t need “global” uniqueness. In fact, I just needed to guarantee statistical uniqueness for all these specific types of calls made by a particular client. I could use the “client id” as part of the lookup to further reduce the “uniqueness” required. I then did some research on UUIDs (GUID is just MS’s brand name for UUID) and saw that later versions of the algorithm use random number generators. I decided that if 128 bits gets me “global” uniqueness, I could get by with a few less for my scenario. So now the question was how to encode those bits

I actually used base 36 over a decade ago in an Xbase application to avoid having to increase the width of a field but there are a couple of problems with it. First, because 36 is not a power of 2, you cannot easily (as in bit shifting) convert bits to digits. Second, 0/O (zero, and oh) and 1/I (one and eye) can easily be confused, particularly in some fonts.

Removing those four digits leads to my choice, base 32 with 2-9 and the capital letters minus O and I as the digits. You can easily convert to base 32 because it’s a power of 2 and you get a very friendly set of digits with good bit density.

Now the final piece of the puzzle was how to generate “good” random numbers. Good for me meant that there was a very, very low probability of collisions, even if I was generating an ID on the same machine every 10 milliseconds or so. As I mentioned above, usage of the service can be bursty. Some clients are psuedo-batching their work and hit the service hard for a relatively short period of time.

While doing some research on .net random number generation, I learned that it is much better to instantiate the generator once than it is to repeatedly instantiate it for each number. This sounds counter-intuitive at first, but makes sense when you think about the time-base default seeding these generators use. So a singleton was in order.

Finally, I had to choose between venerable RAND and RNGCryptoServiceProvider. Although I think using the singleton approach with a single instance of RAND would have worked, I felt more comfortable with the more robust seeding used by RNGCryptoServiceProvider. Particularly since we are running on clusters and our web service calls are sessionless (no affinity).

There’s one final wrinkle. In my scenario, I have a unique (most of the time) 32 bit integer value available. I wanted to incorporate that into my scheme to further reduce the odds of a collision. I do this by taking a byte array as an input that is “mixed” with the random bytes to produce the ID.

Without further adieu, here’s the code for the class itself and some unit tests:

using System;
using System.Security.Cryptography;

public class UniqueIdGenerator
{
	private static readonly UniqueIdGenerator _instance = new UniqueIdGenerator();
	private static char[] _charMap = { // 0, 1, O, and I omitted intentionally giving 32 (2^5) symbols
		'2', '3', '4', '5', '6', '7', '8', '9', 
		'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'J', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'
	};

	public static UniqueIdGenerator GetInstance()
	{
		return _instance;
	}

	private RNGCryptoServiceProvider _provider = new RNGCryptoServiceProvider();

	private UniqueIdGenerator()
	{
	}

	public void GetNext(byte[] bytes)
	{
		_provider.GetBytes(bytes);
	}

	public string GetBase32UniqueId(int numDigits)
	{
		return GetBase32UniqueId(new byte[0], numDigits);
	}
    
	public string GetBase32UniqueId(byte[] basis, int numDigits)
	{
		int byteCount = 16;
		var randBytes = new byte[byteCount - basis.Length];
		GetNext(randBytes);
		var bytes = new byte[byteCount];
		Array.Copy(basis, 0, bytes, byteCount - basis.Length, basis.Length);
		Array.Copy(randBytes, 0, bytes, 0, randBytes.Length);

		ulong lo = (((ulong)BitConverter.ToUInt32(bytes, 8)) << 32) | BitConverter.ToUInt32(bytes, 12); // BitConverter.ToUInt64(bytes, 8);
		ulong hi = (((ulong)BitConverter.ToUInt32(bytes, 0)) << 32) | BitConverter.ToUInt32(bytes, 4);  // BitConverter.ToUInt64(bytes, 0);
		ulong mask = 0x1F;

		var chars = new char&#91;26&#93;;
		int charIdx = 25;

		ulong work = lo;
		for (int i = 0; i < 26; i++)
		{
			if (i == 12)
			{
				work = ((hi & 0x01) << 4) & lo;
			}
			else if (i == 13)
			{
				work = hi >> 1;
			}
			byte digit = (byte)(work & mask);
			chars[charIdx] = _charMap[digit];
			charIdx--;
			work = work >> 5;
		}

		var ret = new string(chars, 26 - numDigits, numDigits);
		return ret;
	}
}

using System;
using System.Collections.Generic;
using NUnit.Framework;

[TestFixture]
public class UniqueIdGeneratorTest
{
[Test]
public void GetInstanceTest()
{
var instance = UniqueIdGenerator.GetInstance();
Assert.AreSame(instance, UniqueIdGenerator.GetInstance());
}

[Test]
public void GetNextTest()
{
var b1 = new byte[16];
for (int i = 0; i < 16; i++ ) b1[i] = 0; UniqueIdGenerator.GetInstance().GetNext(b1); Assert.That(Array.Exists(b1, b => b != 0)); // This could be false every billion years or so
}

[Test]
public void GetBase32UniqueIdTest()
{
var b1 = new byte[16];
for (int i = 0; i < 16; i++) b1[i] = 0; string id = UniqueIdGenerator.GetInstance().GetBase32UniqueId(b1, 26); Assert.AreEqual(26, id.Length); Assert.AreEqual(new string('2', 26), id); b1 = new byte[] { 0xFF, 0xFF, 0xFF, 0xFF }; id = UniqueIdGenerator.GetInstance().GetBase32UniqueId(b1, 26); System.Diagnostics.Trace.WriteLine(id); Assert.AreEqual(26, id.Length); Assert.AreEqual("ZZZZZZ", id.Substring(20, 6)); id = UniqueIdGenerator.GetInstance().GetBase32UniqueId(b1, 6); System.Diagnostics.Trace.WriteLine(id); Assert.AreEqual(6, id.Length); Assert.AreEqual("ZZZZZZ", id); id = UniqueIdGenerator.GetInstance().GetBase32UniqueId(18); System.Diagnostics.Trace.WriteLine(id); Assert.AreEqual(18, id.Length); var id2 = UniqueIdGenerator.GetInstance().GetBase32UniqueId(18); System.Diagnostics.Trace.WriteLine(id2); Assert.AreEqual(18, id2.Length); Assert.AreNotEqual(id, id2); } [Test, Ignore] public void GetBase32UniqueIdDupeTest() { var alreadySeen = new Dictionary(1000000);
System.Diagnostics.Trace.WriteLine(“Allocated”);
for (int i = 0; i < 100000000; i++) { var id = UniqueIdGenerator.GetInstance().GetBase32UniqueId(12); Assert.That(!alreadySeen.ContainsKey(id)); alreadySeen.Add(id, id); } } } [/sourcecode]

Advertisements

Expressions versus Delegates in LINQ to SQL Performance

Recently, I was working on a class that stored a delegate and later used that delegate to retreive a LINQ to SQL object from either the database or an EntitySet, depending on the situation. Unfortunately, the performance for the first case was terrible — I believe it was returning the entire table and then applying the delegate to each row in memory.

Here’s the relevant code (original version):

public abstract Func<D, bool> GetDbKeyQueryFunction(E entity);
//...
D dbEntity;
if ( isRoot ) {
	dbEntity = context.GetTable<D>().SingleOrDefault(GetDbKeyQueryFunction(entity));
} else {
	dbEntity = DbParentSet.Invoke(dbParent).SingleOrDefault(GetDbKeyQueryFunction(entity));
}

Here’s a typical implementation of GetDbKeyQueryFunction:

public override Func<SomeTable, bool> GetDbKeyQueryFunction(SomeEntity entity) {
	return dbEntity => dbEntity.Id == entity.Id;
}

Now before I created this abstract class, I had specific code that used a (seemingly) identical lambda expression to retrieve a row from the database quite quickly:

int id = 11;
dataContext.SomeTable.SingleOrDefault(dbEntity => dbEntity.Id == id));

When I hovered over SingleOrDefault in the IDE, I noticed that there were two overloads, one that took Expression<Func> and another that took Func. On the surface, these don’t seem much different. In fact, the reason I ended up using Func in my abstract class was because Expression is abstract and I couldn’t just put new Expression() around my lambda expession.

In practice, the difference is huge, as I mentioned above regarding the performance. It appears that LINQ to SQL can optimize expressions but cannot optimize delegates. The tricky part is all I had to do was change:

public abstract Func<D, bool> GetDbKeyQueryFunction(E entity);

to:

public abstract Expression<Func<D, bool>> GetDbKeyQueryFunction(E entity);

Surprisingly (to me, anyway), you don’t have to do anything in the implementing code other than modify the method signature to match. The body of the method remains unchanged, but like magic, the compiler is now creating an Expression instead of a Delegate for us.

Performance for the first case (querying the database) was back up to par. However, there was another minor hurdle to overcome.

public abstract Func<D, bool> GetDbKeyQueryFunction(E entity);
//...
D dbEntity;
if ( isRoot ) {
	dbEntity = context.GetTable<D>().SingleOrDefault(GetDbKeyQueryFunction(entity));
} else {
	dbEntity = DbParentSet.Invoke(dbParent).SingleOrDefault(GetDbKeyQueryFunction(entity));
}

In the code above, DbParentSet.Invoke(dbParent) is returning an EntitySet instance. Interestingly enough, you cannot directly apply an Expression to an EntitySet, as I found documented here. However, you can compile an Expression, and in my scenario, the type of the value returned by Compile just happened to be Func (it all seems so obvious now, hehe). So here’s the final version of the code that works well for both scenarioes:

public abstract Expression<Func<D, bool>> GetDbKeyQueryFunction(E entity);
//...
D dbEntity;
if ( isRoot ) {
	dbEntity = context.GetTable().SingleOrDefault(GetDbKeyQueryFunction(entity));
} else {
	dbEntity = DbParentSet.Invoke(dbParent).SingleOrDefault(GetDbKeyQueryFunction(entity).Compile());
}

So remember, always favor Expressions over Delegates when using LINQ. If you have to have a Delegate, you can always use Compile to turn your Expression into a Delegate as needed. But you can’t turn a Delegate into an Expression (that I’m aware of).

Posted in c#, LINQ, SQL. Tags: , , . 3 Comments »

Easily Create DataTables For Unit Tests

Ideally, you have a data access layer that only returns domain objects or data transfer objects. The key is real, strongly-typed objects. One of the many reasons for this is to make unit testing easier. Unfortunately, sometimes you are stuck working with code that uses datatables. I’ve recently written a function that takes most of the pain out of creating datatables for unit testing. This function and the code that calls it takes advantage of several new c# features: lamda expressions, object initializers, list initializers, and auto-implemented properties. It also uses generics and reflection.

The first thing that stinks about creating datatables from scratch is that you have to manually define the columns which is tedious. The second thing that stinks is that you can only add rows to the datatable using an object array which means you have to count commas to keep track of which column you are populating. Or you have to do something awkward like obj[datatable.Columns[“FieldName”].Ordinal] = “somevalue”. To get around this, the first step is to define the structure of our rows by creating a simple class, EG:

public class Person
{
	public string LastName { get; set; }
	public string FirstName { get; set; }
	public DateTime DateOfBirth { get; set; }
	public decimal Salary { get; set; }
}

Now we can call the method with a very convenient syntax that capitalizes on object and list initializers:

public DataTable GetPeople()
{
	return ListToTable(new List<Person> {
		new Person {
			LastName = "Opincar",
			FirstName = "John",
			DateOfBirth = new DateTime(1901, 1, 1),
			Salary = 250000.00M
		},
		new Person {
			LastName = "Lincoln",
			FirstName = "Abe",
			DateOfBirth = new DateTime(1801, 1, 15),
			Salary = 1000.00M
		}
	});
}

If you’ve ever manually populated datatables you can really appreciate what a huge improvement this is.

Finally, we discuss the method itself. Using reflection, we can leverage the type information stored in Person to create our datatable columns in a generic method that takes a List as input and returns a datatable:

public DataTable ListToTable(List rows)
{
var dt = new DataTable();
var props = typeof(T).GetProperties();
Array.ForEach(props, p => dt.Columns.Add(p.Name, p.PropertyType));
foreach ( var r in rows )
{
object[] vals = new object[props.Length];
for (int idx = 0; idx < vals.Length; idx++) { vals[idx] = props[idx].GetValue(r, null); } dt.Rows.Add(vals); } return dt; } [/sourcecode] I hope you find this method as useful as I have.

PGP Single Pass Sign and Encrypt with Bouncy Castle

Bouncy Castle is a great open source resource. However, the off-the-shelf PGP functionality is severely lacking in real-world-usabaility. Most of what you need is easy enough to code up yourself (and I would love to contribute what I’ve done if I could). One thing that you really need that doesn’t come built-in and is actually quite hard to do right is PGP Single Pass Sign and Encrypt. Here’s a class in the style of csharp\crypto\test\src\openpgp\examples\KeyBasedFileProcessor.cs that does exactly that.

using System;
using System.IO;
using Org.BouncyCastle.Bcpg;
using Org.BouncyCastle.Bcpg.OpenPgp;
using Org.BouncyCastle.Security;

namespace PgpCrypto
{
	public class PgpProcessor
	{
		public void SignAndEncryptFile(string actualFileName, string embeddedFileName,
			Stream keyIn, long keyId, Stream outputStream,
			char[] password, bool armor, bool withIntegrityCheck, PgpPublicKey encKey)
		{
			const int BUFFER_SIZE = 1 << 16; // should always be power of 2

			if (armor)
				outputStream = new ArmoredOutputStream(outputStream);

			// Init encrypted data generator
			PgpEncryptedDataGenerator encryptedDataGenerator =
				new PgpEncryptedDataGenerator(SymmetricKeyAlgorithmTag.Cast5, withIntegrityCheck, new SecureRandom());
			encryptedDataGenerator.AddMethod(encKey);
			Stream encryptedOut = encryptedDataGenerator.Open(outputStream, new byte&#91;BUFFER_SIZE&#93;);

			// Init compression
			PgpCompressedDataGenerator compressedDataGenerator = new PgpCompressedDataGenerator(CompressionAlgorithmTag.Zip);
			Stream compressedOut = compressedDataGenerator.Open(encryptedOut);

			// Init signature
			PgpSecretKeyRingBundle pgpSecBundle = new PgpSecretKeyRingBundle(PgpUtilities.GetDecoderStream(keyIn));
			PgpSecretKey pgpSecKey = pgpSecBundle.GetSecretKey(keyId);
			if (pgpSecKey == null)
				throw new ArgumentException(keyId.ToString("X") + " could not be found in specified key ring bundle.", "keyId");
			PgpPrivateKey pgpPrivKey = pgpSecKey.ExtractPrivateKey(password);
			PgpSignatureGenerator signatureGenerator = new PgpSignatureGenerator(pgpSecKey.PublicKey.Algorithm, HashAlgorithmTag.Sha1);
			signatureGenerator.InitSign(PgpSignature.BinaryDocument, pgpPrivKey);
			foreach (string userId in pgpSecKey.PublicKey.GetUserIds())
			{
				PgpSignatureSubpacketGenerator spGen = new PgpSignatureSubpacketGenerator();
				spGen.SetSignerUserId(false, userId);
				signatureGenerator.SetHashedSubpackets(spGen.Generate());
				// Just the first one!
				break;
			}
			signatureGenerator.GenerateOnePassVersion(false).Encode(compressedOut);

			// Create the Literal Data generator output stream
			PgpLiteralDataGenerator literalDataGenerator = new PgpLiteralDataGenerator();
			FileInfo embeddedFile = new FileInfo(embeddedFileName);
			FileInfo actualFile = new FileInfo(actualFileName);
			// TODO: Use lastwritetime from source file
			Stream literalOut = literalDataGenerator.Open(compressedOut, PgpLiteralData.Binary,
				embeddedFile.Name, actualFile.LastWriteTime, new byte&#91;BUFFER_SIZE&#93;);

			// Open the input file
			FileStream inputStream = actualFile.OpenRead();

			byte&#91;&#93; buf = new byte&#91;BUFFER_SIZE&#93;;
			int len;
			while ((len = inputStream.Read(buf, 0, buf.Length)) > 0)
			{
				literalOut.Write(buf, 0, len);
				signatureGenerator.Update(buf, 0, len);
			}

			literalOut.Close();
			literalDataGenerator.Close();
			signatureGenerator.Generate().Encode(compressedOut);
			compressedOut.Close();
			compressedDataGenerator.Close();
			encryptedOut.Close();
			encryptedDataGenerator.Close();
			inputStream.Close();

			if (armor)
				outputStream.Close();
		}
	}
}

TortoiseSVN: Tweaking the Context Menu; Global Ignore

If you use TortoiseSVN alot like I do, then you’ve probably often wished you could put more frequently used items in the first context menu that comes up when you right click on a file in Windows Explorer. By default only update, commit, and checkout are there. You can choose which items appear in the first menu and which appear in the TortoiseSVN sub-menu by selecting Settings, Look and Feel. There you can check and uncheck items to determine whether or not they are on the sub-menu. I got rid of Checkout and included add, delete, rename, and diff.

Another time saver available in settings is Global Ignore Pattern. Items matching this pattern will automatically be ignored when you commit. I currently have mine set to:

*/[Bb]in [Bb]in */obj obj */[Rr]elease *.user *.suo *.resharper */_ReSharper.* _ReSharper.* *.bak *.dll *.pdb.

Compare Files Method in C# Unit Testing

I’ve been working on a class that wraps the low-level PGP crypto capabilities of Bouncy Castle and presents a higher-level, directly-usable interface similar to a command-line utility. One obvious test is to start with a file, sign and encrypt it, then decrypt it, and compare the result with the original. I’ve gotten so spoiled to having so many rudimentary things like this already available in the .net framework I was surprised when I didn’t find a Compare method on the File class or anything like it somewhere else. If there is something already built-in, please post a comment. If there isn’t, hopefully this small method will save a few people a few minutes.

private bool FileCompare(string srcFileName, string dstFileName)
{
const int BUFFER_SIZE = 1 << 16; FileInfo src = new FileInfo(srcFileName); FileInfo dst = new FileInfo(dstFileName); if ( src.Length != dst.Length ) return false; using ( Stream srcStream = src.OpenRead(), dstStream = dst.OpenRead() ) { byte[] srcBuf = new byte[BUFFER_SIZE]; byte[] dstBuf = new byte[BUFFER_SIZE]; int len; while ((len = srcStream.Read(srcBuf, 0, srcBuf.Length)) > 0)
{
dstStream.Read(dstBuf, 0, dstBuf.Length);
for ( int i = 0; i < len; i++) if ( srcBuf[i] != dstBuf[i]) return false; } return true; } } [/sourcecode]

How Many Programming Languages Have You Used?

I’ve been hearing all the buzz for the last several years over dynamic languages like Ruby and Python. I always like to try the latest and greatest from time to time and sometimes even move myself professionally in that direction. I guess because I started out working primarily with interpreted, weakly typed languages and I’m now horribly spoiled by intellisense, I’m having trouble even taking a look at either Ruby or Python (yes, I know they are *technically* not weakly typed, but the “dynamic” part equates in my feeble mind). I was perusing a list of programming languages trying find an interesting alternative to Ruby or Python to play with, and thought it would be interesting to list the ones that I’ve actually written programs in and what level of experience I have with them.

BASIC — Hobby (Like many others, lowly BASIC got me hooked on programming at an early age)
dBaseIII — Production
Clipper — Production, Commercial
TurboPascal — Academic
DOS Batch — Production
8086 Assembly — Production, Commercial
Modula-2 — Academic
PDP-11 Assembly — Academic
Lisp — Academic
Prolog — Academic
Smalltalk — Academic
Ada — Academic
Actor — Hobby (Am I forgetting the name or how it’s spelled? Can’t find a link.)
C — Academic, Production, Commercial
C++ — Academic, Production
Protel (Nortel proprietary — I actually worked on the DMS 250) — Production
Rexx — Production
Visual Basic — Production
T-SQL — Production
Java — Production
PL-SQL — Production
C# — Production
PHP — Production

One language I left out was the pet project of a professor of mine in college. It was Prolog-like language that he had us implement as our major assingment in compilers class. I actually won a copy of his book (aren’t you jealous :p) for having the best compiler in the class. Unfortunately, I threw the book out a few years ago and I can’t recall the name of the professor or the langauge for the life of me.

So which languages have you used and which ones are you interested in learning?