2017-02-28

GoogleのSpannerに関する論文の和訳 1/6

論文和訳

Googleが2013年に発表したSpannerに関する論文の和訳です。長いので全6回ぐらいに分けて訳していこうと思います。

Spanner: Google’s Globally Distributed Database

http://dl.acm.org/citation.cfm?id=2491245

Spanner is Google’s scalable, multiversion, globally distributed, and synchronously replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions. This article describes how Spanner is structured, its feature set, the rationale underlying various design decisions, and a novel time API that exposes clock uncertainty. This API and its implementation are critical to supporting external consistency and a variety of powerful features: nonblocking reads in the past, lockfree snapshot transactions, and atomic schema changes, across all of Spanner.

Spannerは、Googleのスケーラブルで、マルチバージョン（追記：更新前の過去の行も保持している）で、グローバルに分散された、同期的レプリケーションされているデータベースです。これは、グローバル規模でデータを分散し、かつ一貫した分散トランザクションを提供する最初のシステムです。この論文では、Spannerの構造、その機能セット、さまざまな設計上の決定の根拠、時刻の不確実性を提供する新しいTime API(追記：時刻を返すのではなく確実に現在時刻が含まれる帯域を返す)について説明します。このTime APIは、一貫性と様々な強力な機能（過去のデータのノンブロッキング読み取り、ロックフリースナップショットトランザクション、およびアトミックなスキーマ変更）をSpannerでサポートするために重要なAPIです。

1.INTRODUCTION

Spanner is a scalable, globally distributed database designed, built, and deployed at Google. At the highest level of abstraction, it is a database that shards data across many sets of Paxos [Lamport 1998] state machines in datacenters spread all over the world. Replication is used for global availability and geographic locality; clients automatically failover between replicas. Spanner automatically reshards data across machines as the amount of data or the number of servers changes, and it automatically migrates data across machines (even across datacenters) to balance load and in response to failures. Spanner is designed to scale up to millions of machines across hundreds of datacenters and trillions of database rows.

1. 序論

Spannerは、Googleで設計、構築、展開されたスケーラブルでグローバルに分散したデータベースです。最も抽象度の高いレベルは世界中に広がるデータセンター上で動作するPaxosステートマシンのセットで、グローバルな可用性と地理的冗長化のためにレプリケーションされます。クライアントはレプリカ間で自動的にフェールオーバーします。 Spannerは、データ量やサーバー数が変化すると自動的にデータを再共有し、負荷を分散し、障害に対応してマシン間でデータを自動的に移行します。 Spannerは、数百万台のマシン、何百のデータセンター、何兆のデータベース行にまでスケールアップできるように設計されています。

Applications can use Spanner for high availability, even in the face of wide-area natural disasters, by replicating their data within or even across continents. Our initial customer was F1 [Shute et al. 2012], a rewrite of Google’s advertising backend. F1 uses five replicas spread across the United States. Most other applications will probably replicate their data across 3 to 5 datacenters in one geographic region, but with relatively independent failure modes. That is, most applications will choose lower latency over higher availability, as long as they can survive 1 or 2 datacenter failures.

広域自然災害においてもSpannerがデータを大陸内または大陸間で複製することによって、アプリケーションは可用性を維持することができます。最初の事例はF1[Shute et al. 2012]（GoogleでAdWordsビジネスをサポートするために構築された分散リレーショナルデータベースシステム）です。 F1は、米国全土に広がる5つのレプリカを使用しています。他のほとんどのアプリケーションでは、おそらく1つの地域の3～5のデータセンターにデータを複製しますが、比較的に独立して障害が発生します。ほとんどのアプリケーションでは、1～2つのデータセンターの障害に耐えられる限り、高可用性よりも低レイテンシを選択します。

Spanner’s main focus is managing cross-datacenter replicated data, but we have also spent a great deal of time in designing and implementing important database features on top of our distributed-systems infrastructure. Even though many projects happily use Bigtable [Chang et al. 2008], we have also consistently received complaints from users that Bigtable can be difficult to use for some kinds of applications: those that have complex, evolving schemas, or those that want strong consistency in the presence of wide-area replication. (Similar claims have been made by other authors [Stonebraker 2010b].) Many applications at Google have chosen to use Megastore [Baker et al. 2011] because of its semirelational data model and support for synchronous replication, despite its relatively poor write throughput. As a consequence, Spanner has evolved from a Bigtable-like versioned key-value store into a temporal multiversion database. Data is stored in schematized semirelational tables; data is versioned, and each version is automatically timestamped with its commit time; old versions of data are subject to configurable garbage-collection policies; and applications can read data at old timestamps. Spanner supports general-purpose transactions, and provides an SQL-based query language.

Spannerが注力するのはデータセンター間での複製データを管理することですが、我々は分散インフラ上にデータベース機能を設計・実装するのにも多大な時間を費やしました。多くのプロジェクトは喜んでBigtable [Chang et al. 2008]を利用していましたが、複雑なスキーマを持つアプリケーション、スキーマを変更し続けていくアプリケーション、グローバル規模でレプリケーションしつつ強力な一貫性を必要とするアプリケーションにおいてはBigtableを使用することが困難であるという報告をユーザーから受けていました。（同様の主張は、他の著者[Stonebraker 2010b]も行っています）。Googleの多くのアプリケーションは、セミリレーショナルデータモデルと同期レプリケーションのサポートのために、書き込みスループットが比較的低いにもかかわらずMegastore [Baker et al. 2011]を使用することを選択しました。結果として、Spannerは、Bigtableのようなバージョン管理されたキーバリューストアからテンポラルマルチバージョンデータベースに進化しました。データは、スキーマ定義されたセミリレーショナルテーブルに格納されます。データのバージョン管理が行われ、各バージョンのコミット時間が自動的にタイムスタンプされます。古いバージョンのデータはポリシー設定可能なガベージコレクションの対象です。アプリケーションは古いタイムスタンプでデータを読み取ることができます。 Spannerは汎用トランザクションをサポートし、SQLベースのクエリ言語を提供します。

As a globally distributed database, Spanner provides several interesting features. First, the replication configurations for data can be dynamically controlled at a fine grain by applications. Applications can specify constraints to control which datacenters contain which data, how far data is from its users (to control read latency), how far replicas are from each other (to control write latency), and how many replicas are maintained (to control durability, availability, and read performance). Data can also be dynamically and transparently moved between datacenters by the system to balance resource usage across datacenters. Second, Spanner has two features that are difficult to implement in a distributed database: it provides externally consistent [Gifford 1982] reads and writes, and globally consistent reads across the database at a timestamp. These features enable Spanner to support consistent backups, consistent MapReduce executions [Dean and Ghemawat 2010], and atomic schema updates, all at global scale, and even in the presence of ongoing transactions.

グローバルに分散しているデータベースとして、Spannerにはいくつかの興味深い機能があります。第1に、データのレプリケーション構成は、アプリケーションごとに細かく動的に制御できます。アプリケーションでは、どのデータセンターにどのデータを保持するか、ユーザーからのデータの距離（読み取りレイテンシの制御）、レプリカ間の距離（書き込み待ち時間の制御）、保持されるレプリカの数（耐久性、可用性、および読み取りパフォーマンスの制御）を制御できます。システムによってデータセンター間でデータを動的かつ透過的に移動して、データセンター間のリソース使用状況のバランスをとることもできます。第2に、Spannerには分散データベースで同時に実現するのが難しい2つの機能があります。一貫した読み書きトランザクションの提供[Gifford 1982]と、（タイムスタンプを利用した）データベース全体における一貫した読み取りの提供です。これらの機能により、Spannerは一貫性のあるバックアップ、一貫性のあるMapReduceの実行[Dean and Ghemawat 2010]、およびアトミックなスキーマ更新をすべてグローバル規模で、また進行中のトランザクションが存在する場合でもサポートできます。

These features are enabled by the fact that Spanner assigns globally meaningful commit timestamps to transactions, even though transactions may be distributed. The timestamps reflect serialization order. In addition, the serialization order satisfies external consistency (or equivalently, linearizability [Herlihy and Wing 1990]): if a transaction T1 commits before another transaction T2 starts, then T1’s commit timestamp is smaller than T2’s. Spanner is the first system to provide such guarantees at global scale.

これらの機能は、トランザクションが分散されていても、Spannerがトランザクションにグローバルに正確なコミットタイムスタンプを割り当てることで実現可能となります。タイムスタンプはシリアル化順序を反映します。さらに、シリアル化順序は外部整合性（または等価的に線形性[Herlihy and Wing 1990]）を満たします。トランザクションT1が別のトランザクションT2が開始する前にコミットすると、T1のコミットタイムスタンプはT2よりも小さくなります。 Spannerは、世界規模でそのような保証を提供する最初のシステムです。(追記：一般的にリレーショナルデータベースはトランザクションにIDを割り当てて処理順の整合性を取るがSpannerではIDの代わりに正確なタイムスタンプを利用している。)

The key enabler of these properties is a new TrueTime API and its implementation. The API directly exposes clock uncertainty, and the guarantees on Spanner’s timestamps depend on the bounds that the implementation provides. If the uncertainty is large, Spanner slows down to wait out that uncertainty. Google’s cluster-management software provides an implementation of the TrueTime API. This implementation keeps uncertainty small (generally less than 10ms) by using multiple modern clock references (GPS and atomic clocks). Conservatively reporting uncertainty is necessary for correctness; keeping the bound on uncertainty small is necessary for performance.

これらの機能を実現可能とするのは、新しいTrueTime APIとその実装です。 APIは時刻の不確実性（追記：現在時刻が確実に含まれる時刻帯域）を返し、Spannerで保証できることは、そのTrueTime APIの限界に依存します。不確実性が大きい場合、Spannerはその不確実性を待つために減速します。 Googleのクラスタ管理ソフトウェアがTrueTime APIの実装を提供します。この実装は、複数の最新の時計（GPSおよび原子時計）を参照することで、不確実性を小さく抑えます（一般に10ms未満）。正確さのために不確実性の帯域を大きめに返却する必要がありますが、パフォーマンスのためには不確実性の帯域を小さくすることが必要です。

Section 2 describes the structure of Spanner’s implementation, its feature set, and the engineering decisions that went into their design. Section 3 describes our new TrueTime API and sketches its implementation. Section 4 describes how Spanner uses TrueTime to implement externally-consistent distributed transactions, lock-free snapshot transactions, and atomic schema updates. Section 5 provides some benchmarks on Spanner’s performance and TrueTime behavior, and discusses the experiences of F1. Sections 6, 7, and 8 describe related and future work, and summarize our conclusions.

セクション2では、Spannerの実装構造、機能セット、および現在の設計に至るまでの意思決定について説明します。セクション3では、新しいTrueTime APIと、その実装について説明します。セクション4では、SpannerがTrueTime APIを使用して一貫性のある分散トランザクションの提供、ロックフリーのスナップショットトランザクション、およびアトミックなスキーマ更新を実装する方法について説明します。セクション5では、SpannerのパフォーマンスとTrueTime APIの動作に関するベンチマークを提供し、F1の経験について説明します。 6章、7章、8章では、関連する作業と今後の作業について説明し、結論をまとめます。

blog.game-programmer.jp

2017-02-21

謎の新RDB「Google Cloud Spanner」について聞いてきた

イベント

Google Cloud Spannerの情報目当てで下記のイベントに言ってきたのでメモ。

イベントページ

connpass.com

ついに出た！Google虎の子のNewSQL RDB「Spanner」 by Google

Spannerに関する技術メモ from Etsuji Nakai

ACID特性

信頼性のあるトランザクションシステムの持つべき性質にACID特性があります。

ACID (コンピュータ科学) - Wikipedia

スライド5ページ目でVertical Consistencyと書かれているのは同一Zone内でのACID特性。Horizontal Consistencyと書かれているのはZoneをまたぐACID特性の事のようです。

ACID特性のうち独立性（isolation）には、その独立性のレベル応じて「トランザクション分離レベル」というものがあります。

gyouza-daisuki.hatenablog.com

SQL Serverには、上記の記事に存在しないトランザクション分離レベルで「SNAPSHOT」分離レベルというものがあるのですが

SET TRANSACTION ISOLATION LEVEL (Transact-SQL)

SQL Serverのデータベースを「SNAPSHOT」分離レベルにすると行情報にTransaction Timestampが追加されます。

Spannerが分離レベルを一定以上に保つためにTrueTime API（原子時計とGPSを用いた超誤差が少ない時計）を必要とするのは、この行情報に含まれるTimestampが全てのノードで正確であれば分散環境でも「SNAPSHOT」分離レベルを達成できる的な発想のようです。

blog.engineer-memo.com

Spannerが読み取り専用トランザクションでロックを必要としないのは、行が更新中であってもトランザクションで指定したTimestamp（デフォルトは現在日時）以前の行を読めば良いからですね。ちなみに登壇者へ質問したところ1時間前までの行バージョンが保持されていて読み取りできるとのことです。

レプリケーション

レプリケーションに関する話はまた次回ということで、分散合意形成アルゴリズムにPaxosを使っているということぐらいしか触れられていませんでした。

有名な分散合意形成アルゴリズム

2相コミット - Wikipedia

3相コミット - Wikipedia

Paxosアルゴリズム - Wikipedia

Paxosの解説

d.hatena.ne.jp

TODO

Spannerについては公式のドキュメントと論文があるのでそのうち読んでみようと思います。そこで何か発見があればまた何か書くかもしれません。

公式ドキュメント

cloud.google.com

論文「Spanner: Google’s Globally-Distributed Database」

https://static.googleusercontent.com/media/research.google.com/ja//archive/spanner-osdi2012.pdf

2016-12-16

スーパーマリオランで使われている技術を権利表記から確認してみる

Super Mario Run

Nintendo Co., Ltd.
ゲーム
無料

スーパーマリオラン出ましたね。「メニュー」→「せってい」→「このアプリについて」の中にある「権利表記」の内容から使われている技術を確認していきたいと思います。

UniRx

f:id:master-0717:20161216103503p:plain

github.com

UniRxはReactive ExtensionsをUnityで使えるように再実装されたライブラリです。仕事でUnityを使う時にはいつもお世話になっています。 UniRx作者本人のブログ記事が一番詳しいと思うので、時系列で紹介します。

FlatBuffers

f:id:master-0717:20161216103512p:plain

github.com

FlatBuffersについてはDocumentationのOverviewをGoogle翻訳すると下記の通りです。

FlatBuffersは、C ++、C＃、C、Go、Java、JavaScript、PHP、およびPython用の効率的なクロスプラットフォームシリアライズライブラリです。もともとは、ゲーム開発やその他のパフォーマンス重視のアプリケーション向けにGoogleで作成されたものです。

FlatBuffersはデシリアライズのパフォーマンスに特化したシリアライザで、なぜ高速にデシリアライズできるのかというと、定義されたスキーマに則ってシリアライズしたバイト配列をオブジェクトへのアクセス時にそのまま利用する実装になっているからです。

UniRx作者がFlatBuffersにインスパイアされてZeroFormatterというシリアライザを作成公開しています。

github.com

ZeroFormatterもデシリアライズのパフォーマンスに特化したシリアライザで、シリアライズしたバイト配列をオブジェクトへのアクセス時にそのまま利用するところはFlatBuffersと同様ですが、C#のクラス定義をそのままスキーマ定義として利用できるようになっています。なので例えばC#（UnityやUWP等）で作成したゲーム内でクライアントが頻繁に通信をする場合などにFlatBuffersよりも楽に使えます。

ちなみにC#で使えるシリアライザのパフォーマンス比較結果がZeroFormatterリポジトリのREADMEにあります。

https://github.com/neuecc/ZeroFormatter#performance

OpenSSL

f:id:master-0717:20161216103519p:plain

https://www.openssl.org/

OpenSSLはTLSおよびSSLプロトコル用のライブラリです。Facebook SDK for Unityが依存しているので入っているのではないでしょうか。

iOS Native Code Samples

f:id:master-0717:20161216103526p:plain

bitbucket.org

iOS Native Code SamplesはUnity開発元のUnity Technologies社が公開しているサンプルコードです。

MiniJson

f:id:master-0717:20161216103547p:plain

Unity3D: MiniJSON Decodes and encodes simple JSON strings. Not intended for use with massive JSON strings, probably < 32k preferred. Handy for parsing JSON from inside Unity3d. · GitHub

MiniJsonはJSONシリアライザです。勝手な予想ですが、FlatBuffersは対戦などリアルタイム通信系のシリアライズ・デシリアライズが頻繁に行われる箇所に、JSONはAPIとのHTTPS通信等たまにシリアライズ・デシリアライズが行われる箇所に使用されていたりするのではないでしょうか？

Facebook SDK for Unity

f:id:master-0717:20161216103602p:plain

developers.facebook.com

Facebook SDK for UnityはFacebookログイン、Facebookへのシェア、Facebookの友人とデータ共有（ランキングとか）などをUnityで開発するアプリで利用するためのライブラリです。

adjust SDK for Unity

f:id:master-0717:20161216103618p:plain

docs.adjust.com

adjust SDK for Unityは、売上、セッション数、インストール数などのKPIを収集して可視化したり、広告の効果測定を行うためのadjustというサービスを利用するためのライブラリです。

Firebase-Unity

f:id:master-0717:20161216103631p:plain

firebase.google.com

Firebase-Unityは、Push通知、クライアント・サーバー間やクライアント同士のデータ同期、分析、テスト、クラッシュレポート等のモバイルアプリの開発・運用で必要になる技術をサービスとして提供するFirebaseというmBaaSを利用するためのライブラリです。

総括

FlatBuffersを使っていてパフォーマンスを強く意識している感じがします。Firebaseを使っていてサーバー側はマネージドなサービスで運用できている感じもしますね。ゲームあるところにObserverパターンありといった感じで、一度独自Observerパターンの代わりにUniRxを使ったらもう戻れないぐらい便利なので、UniRxは多くのUnity開発者にとって手放せないライブラリになってきているのではないでしょうか。現場からは以上です。