Mar 102012
 

On OS X, the loader dyld does have a search path, defined in the DYLD_FRAMEWORK_PATH and DYLD_LIBRARY_PATH variables. However, these are empty on OS X by default, so they rarely matter.

Sometimes, we want to install a third-party library to a location, which is not system-defined, not /usr/local/lib nor /usr/lib, for some personal reasons or when you did not have root privilege. Suppose it was libdummy. If libdummy‘s install_name was just libdummy.1.dylib or so, and you were building a program which links against libdummy. After the compilation, you checked the shared libraries your program used:

otool -L program

then you could see libdummy.1.dylib in the output, just libdummy.1.dylib in that line. Ja, the linker stored install_name there, not the location of the library.
Then you let the program run, but the dyld said it could not find the proper libdummy.1.dylib. That’s the story a lot of people would experience on OS X.

On Linux, we could modify /etc/ld.so.conf in order to include some other directories when the loader searches for a library. However, on OS X things are different. Also, we would not like to set DYLD_ variables every time launching the program, nor add those variables in .zshrc, .bashrc, …

De facto, as we know install_name matters, we could simply employ it.

On Darwin platform, gcc has some platform dependent options, such as -dynamic, -arch, -bundle, and -install_name plays an important role here.

1
gcc -o libdummy.dylib -install_name ${PREFIX}/lib/libdummy.dylib ...

would set install_name for libdummy.dylib to a well-defined path. Next time when linking your program against libdummy, the linker would store that path. Use otool -D to print the install_nam for specified library.

Besides absolute paths, we could use other techniques as well:
@executable_path, @loader_path, @rpath. This article describes these very well.

For already built libraries and programs, there is no need to rebuild them. On OS X, there is a very useful tool: install_name_tool.

  • change install_name for a library:
    install_name_tool -id “new_install_name” libdummy.dylib
  • change linked install_name in a program:
    install_name_tool -change “old_install_name” “new_install_name” program

One more thing. If you use cmake to generate Makefile for you project, you could solve the install_name issue like this:

1
SET(CMAKE_INSTALL_NAME_DIR @executable_path)

Replace @executable_path with your own choice.

Feb 182012
 

HBase provides a web UI to support simple inspection, through which a user could even perform compact and split requests. Now I want to monitor the network usage on the server where the HRegionServer is running, especially how many bytes are received and transmitted per second; also, how about the cpu, disk i/o, and memory information?

To see the statistics through the web UI, we should edit the .jsp first.
Add the codes below at the proper position in regionserver.jsp:

1
2
3
4
<h2>Region Server Monitor</h2>
<div>
<img src="/monitor/envmon.png" alt="Monitoring status" title="Network Status" />
</div>

Ja, the monitor would generate a .png image and save it somewhere, and then the web server could read it from its /monitor/envmon.png. Thus, next step is setting up the linking from “somewhere” to the target for the web server. HBase and its underlying HDFS use “Jetty” to do the web server job. Add some codes in org.apache.hadoop.hbase.util.InfoServer.addDefaultApps:

1
2
3
4
5
6
    // add monitor dir;
    String monDir = "/tmp/envmon";
    Context monContext = new Context(parent, "/monitor");
    monContext.setResourceBase(monDir);
    monContext.addServlet(DefaultServlet.class, "/");
    defaultContexts.put(monContext, true);

Continue reading »

Jan 222012
 

去年曾经试图也做年终总结,终发现太过苍白,最后下架。这次,趁着农历新年还未到来,趁着实验室组会的年终总结的余热还未完全散去,我还是把 2011 年的总结给补完吧。本文的今年,指的是 2011 年。

实验室工作

既然已经在组会上总结过,这里就只简单提一下。2011 年写过最多的代码无非就是 bashjava 了。bash + expect 在年初搭建服务器集群的时候节省了很多时间。至于 java 那就完全是 Hadoop + HBase 了。
HBase 方面,对写操作性能的提升有一定进展,还在做其他部分;对读操作性能问题,已经被官方改进,没有太大的提升余地了。其余略去不表。

Paper Reading

主要的几篇:

  • CloudCache: Expanding and Shrinking Private Caches (HPCA’11)
  • CloneCloud: Elastic Execution Between Mobile Device and Cloud (Eurosys’11)
  • Tradeoffs between Profit and Customer Satisfaction for Service Provisioning in the Cloud (HPDC’11)
  • Mesos (NSDI’11)
  • A File is Not a File: Understanding the I/O Behavior of Apple Desktop Applications (SOSP’11)
  • Differentiated Storage Services (SOSP’11)
  • YCSB++: Benchmarking and Performance Debugging Advanced Features in Scalable Table Stores (SOCC’11)
  • Ricardo: Integrating R and Hadoop (SIGMOD’10)
  • Apache Hadoop Goes Realtime at Facebook (SIGMOD’11)
  • PARDA: Proportional Allocation of Resources for Distributed Storage Access (FAST’09)
  • HiTune: Dataflow-Based Performance Analysis for Big Data Cloud (ATC’11)

其中,个人觉得最好的是 Mesos;A File is Not a File 那篇尽管是 SOSP’11 的 best paper,也挖了个大坑,但总觉得像不走寻常路;另一篇 SOSP 是 Intel 的,最初也发在 OSR 上,写的通俗易懂,去开会的那哥们语速暴快,里面用物流比喻 I/O 让我倍感亲切。
Continue reading »

Dec 192011
 

YCSB is not an ad hoc benchmark for HBase, instead it aims for multiple cloud storages, including Cassandra, HBase, JDBC, MongoDB, Redis, Voldemort, etc., to facilitate performance comparisons for the new generation of cloud data serving systems.

Till now, YCSB (and YCSB++) probably is the only one useful benchmark to evaluate the performance of HBase. However, it does have some issues.

  1. It’s measurement is not accurate. I do not mean the ‘precision’ of the measurement. In multiple threads environment, when a large amount of the WRITE operations have been sent during the default 10 seconds measurement period, some operations may not be returned in time (in the period), however they have been counted in. HTable.put() default pools the WRITE requests, and batches the operations in RPC. These would cause inconsistency between ‘throughput’ and ‘latency’ in measurement, letting a kind of drifting phenomenon occur.
  2. Sequential is not that sequential. There are two types of WRITE operation in YCSB: INSERT and UPDATE. De facto, they share the same HTable.put() client interface, while the only difference is the key generator. By default, INSERT uses a sequential key generator, CounterGenerator, which generates a sequence of integers 0, 1, …, While UPDATE uses ScrambledZipfianGenerator, which is a generator of a zipfian distribution.
    Intuitively, you might want to replace zipfian with uniform in UPDATE/INSERT to simulate random writes to HBase, and use default INSERT to simulate sequential writes to HBase. Affirmative to the former, but probably negative to the latter. The issue comes from that there is only one instance of key generator, so that if there are plenty of threads, the keys dispatched by each thread is not that sequential. More threads, more fragmentations. Thus you would not get the expected results. One probable solution is rewriting the generator, or let each thread hold its own instance of key generator to generate its range of keys.
  3. Too micro. Creating a macro benchmark of cloud storage like HBase is rather difficult, but what YCSB does is too micro. Although it could spawn many many threads to simulate multiple client actions, it is not multiple-client truly. In real cases, a considerable quantities of clients write and read their own key-values stored in cloud data services. The workload simulation in YCSB could only cover a single type, or mixture of types of operations, for on client. Why? The only one instance of key generator raises the issue, again. Fortunately, an extension of YCSB, called YCSB++ has been open sourced, which could really use multiple clients running on different nodes, synchronized via zookeeper cluster. However, we still need a macro benchmark for deep evaluation of HBase.

Evaluating cloud data storage services is a massive, iterative, boring job. Sometimes intuition does not work; sometimes you have to wait a long time to grab a useful log for later analysis. Analysis skills are essential; data visualisation is quite helpful.

Dec 182011
 

When re-engineering geotr the other days, I want to add some compiler information into the product binary, like what vim outputs:

VIM – Vi IMproved 7.3 (2010 Aug 15, compiled Dec 2 2011 13:23:14)
MacOS X (unix) version

Compilation: clang -c -I. -Iproto -DHAVE_CONFIG_H -DMACOS_X_UNIX -no-cpp-precomp -O3 -arch x86_64 -m64 -I/System/Library/Frameworks/Tcl.framework/Headers -D_REENTRANT=1 -D_THREAD_SAFE=1 -D_DARWIN_C_SOURCE=1 -I/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/universal-darwin10.0 -DRUBY_VERSION=18
Linking: clang -L. -L/usr/local/lib -o vim -lm -lncurses -liconv -framework Cocoa -framework Python -F/System/Library/Frameworks -framework Tcl -framework CoreFoundation -lruby

Continue reading »

Dec 092011
 

HBase: The Definitive Guide 中, Lars George 介绍了 HBase 的一个新特性 Counter Increment,即把一个 column 当作 一个 counter,这样便于给某些应用提供统计功能。

传统上,如果没有 counter,当我们要给一个 column 的值 +1 或者其他数值时,就需要先从该 column 读取值,然后在客户端修改值,最后写回给 Region Server,即一个 Read-Modify-Write (RMW) 操作。在这样的过程中,按照 Lars 的描述1,还需要对操作所在的 row 事先加锁,事后解锁。这会引起许多 contention,以及随之而来的很多问题。而 HBase 的 increment 接口就保证在 Region Server 端原子性的完成一个客户端请求。

至于 increment 的性能如何,我们只有做测试才能知道。YCSB 已经提供了 Read-Modify-Write 的测试接口,而 increment 接口需要自己完成2
Continue reading »

  1. HBase: The Definitive Guide, p168. O’Reilly
  2. Patch 1
Dec 022011
 

我把 lua 安装在 /opt,于是编译 vim 时,即使加了 –enable-luainterp=yes 参数,默认也不会开启这个选项。只能用 –with-lua-prefix=/opt 显示的指明 lua 的安装位置才行。不过 make 最后会报错,ld 找不到 liblua5.1.so 或者 Mac OS X 下报找不到 liblua.dylib
检查 /opt/lib 目录,果然没有动态链接库,只有一个 liblua.a。翻了 luaMakefile,真的没有生成动态链接库。翻到这篇文章,发现需要自己补全 Makefile。在 OS X 下需要把 .so 后缀改成 .dylib1。这样在 OS X 下编译通过,然后 vim 就可以链接 lua 的库了。
不过在 Linux 下又遇到一个问题。在编译 liblua.so 的时候,报错:

/usr/bin/ld: lapi.o: relocation R_X86_64_32 against `luaO_nilobject_’ can not be used when making a shared object; recompile with -fPIC
lapi.o: could not read symbols: Bad value
collect2: ld returned 1 exit status
clang: error: linker command failed with exit code 1 (use -v to see invocation)

似乎是之前编译完的 lapi.o 不支持编译成动态链接库,ld 提示用 -fPIC。翻到这篇文章,把 -fPIC 加到 CFLAGS 即可。这样就完成 lua 的编译。另外编译 vim 时的链接参数是 -L/opt -llua5.1,于是加个软链接:

cd /opt/lib
sudo ln -s liblua.so liblua.5.1.so

vim 编译完成还需要把 /opt/lib 加到 LD_LIBRARY_PATH 里面

sudo echo "/opt/lib" >> /etc/ld.so.conf.d/adhoc.conf
sudo ldconfig

否则 vim 启动就会报 liblua.so 找不到的错误,除非编译时用的是 –enable-luainterp=dynamic

最后记一下 vim 编译参数:

./configure --with-features=huge \
--enable-cscope --enable-rubyinterp --enable-python3interp=yes --enable-pythoninterp=yes --enable-perlinterp=yes \
--enable-multibyte --enable-fontset --disable-gui --disable-netbeans \
--enable-luainterp=yes --with-lua-prefix=/opt --enable-largefile --enable-tclinterp \
--with-compiledby=marcus@zyxar.com --prefix=/opt --bindir=/opt/bin \
CC=clang CFLAGS="-O3 -m64"

  1. 另外 clang 可以用 -shared 或者 -dynamiclib, gcc-dynamiclib
Nov 152011
 

这份笔记根据 Foundations of Computer Systems Research 第15章排队论而作。增加了一些书中公式的推导,更正了书中一些公式或者表述上的谬误。

概率分布

  • 几何分布 (Geometric Distribution)
    假设变量 \(X\) 表示某事件第一次发生时所进行的独立试验的次数。如果
    \[
    P(X=n) = (1-p)^{n-1}p, \quad n = 1, 2, 3, \cdots
    \]
    那么,\(X\) 服从参数为 \(p\) 的几何概率分布。对于几何分布来说,
    \[
    \mathrm{E}(X) = \frac{p}{1-p}, \quad \sigma^2(X) = \frac{p}{1-p^2}
    \]
    除此之外,几何分布最重要的特性就是无记忆性 (memoryless),即
    \[
    P(X>M+N|X>M) = \frac{1-P(X \leqslant M+N)}{1 - P(X \leqslant M)} = \frac{(1-p)^{M+N}}{(1-p)^{M}}
    = P(X>N)
    \]
    几何分布也是离散分布中唯一的无记忆性的随机分布。它跟下面要说的指数分布非常相似。
  • Continue reading »

Oct 222011
 

在上篇 blog 中介绍了 HDFS 如何维护一个 block 到 volume 的映射,并且知道 HDFS 会‘循环’的使用 volume,保证各个 volume 被平衡的使用。但是在某些应用场景中,我们可能需要指定某个 block 被创建在某个 volume 里。尽管我们不需要知道,也不需要保证该 block 被创建在哪个 Datanode 上,但是如果能指定 volume 的话,就能避免某些 block 被创建在相同的 volume 里面。
为了提供这个功能,我们就需要自己在 HDFS 中挖掘一条直通到本地文件系统的隧道。
那么首先,我们来跟随 HDFS 的代码,了解一下在 HDFS 中创建文件(假设这里使用的文件为 SequenceFile)的比较完整的流程。
org.apache.hadoop.io.SequenceFile 通过静态方法 createWriter() 创建 Writer 实例,而 Writer 是往 SequenceFile 写入数据的内部类。该静态方法根据选用的压缩类型,调用相应的 Writer 的构造函数。比如:

1
2
3
4
5
6
7
8
9
10
11
12
13
    /** Create the named file with write-progress reporter. */
    public Writer(FileSystem fs, Configuration conf, Path name,
                  Class keyClass, Class valClass,
                  int bufferSize, short replication, long blockSize,
                  Progressable progress, Metadata metadata)
      throws IOException {
      init(name, conf,
           fs.create(name, true, bufferSize, replication, blockSize, progress),
              keyClass, valClass, false, null, metadata);
      initializeFileHeader();
      writeFileHeader();
      finalizeFileHeader();
    }

这其中 fs.create(name, true, bufferSize, replication, blockSize, progress) 返回一个 FSDataOutputStream。这个 fs.create() 调用 FileSystem.java 中的 fs.create() 抽象方法。在 HDFS 场景中这个方法是由 DistributedFileSystem.java 实现:

1
2
3
4
5
6
7
8
9
10
  public FSDataOutputStream create(Path f, FsPermission permission,
    boolean overwrite,
    int bufferSize, short replication, long blockSize,
    Progressable progress) throws IOException {

    return new FSDataOutputStream
       (dfs.create(getPathName(f), permission,
                   overwrite, replication, blockSize, progress, bufferSize),
        statistics);
  }

Continue reading »

Oct 212011
 

HDFS 通过 dfs.data.dir 字段在配置文件中查询 DFS 的数据在本地文件系统中的存放位置。如果在服务器上配置了多块硬盘(假设都已经挂载到本地文件系统中),我们希望 HDFS 能尽量均衡、充分的利用磁盘。理论上 HDFS 也确实能胜任这项工作。在 HDFS 中,这样的一个存放数据的本地文件系统中的目录被称为 volume
直接定位到 Datanode.java 中的代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
    public static DataNode createDataNode(String args[],  Configuration conf) throws IOException {
        DataNode dn = instantiateDataNode(args, conf);
        runDatanodeDaemon(dn);
        return dn;
    }

    public static DataNode instantiateDataNode(String args[], Configuration conf) throws IOException {
        //...
       String[] dataDirs = conf.getStrings("dfs.data.dir");
       dnThreadName = "DataNode: [" +
                            StringUtils.arrayToString(dataDirs) + "]";
       return makeInstance(dataDirs, conf);
    }

在真正实例化之前,代码会先拿到配置文件中定义的 dfs.data.dir 对应的字符串 dataDirs。然后在 makeInstance(dataDirs, conf) 方法中检查 dataDirs 在本地文件系统中是否存在、可用。只要有一个 DIR 可用,就会 new 一个 DataNode 出来。
构造函数 DataNode() 直接调用 startDataNode(conf, dataDirs) 方法。这其中跟数据相关的代码如下:

1
2
3
4
5
6
7
8
9
10
11
    startDataNode(){
        //…
        storage = new DataStorage();
        //…
        // read storage info, lock data dirs and transition fs state if necessary
        storage.recoverTransitionRead(nsInfo, dataDirs, startOpt);
        // adjust
        this.dnRegistration.setStorageInfo(storage);
        // initialize data node internal structure
        this.data = new FSDataset(storage, conf);
    }

storage.recoverTransitionRead(nsInfo, dataDirs, startOpt) 中还会对 dataDirs 做检查:

Continue reading »