View on GitHub

q

q - Text as Data

Download the last stable version zip file Download the last stable version tar.gz file Download the latest stable version RPM Download the latest stable version Debian package Download the single-file executable Download the single-file executable

General

Documentation

Behind the Scenes

Author

Social

Examples

The -H flag in the examples below signifies that the file has a header row which is used for naming columns.

The -t flag is just a shortcut for saying that the file is a tab-separated file (any delimiter is supported - Use the -d flag).

Queries are given using upper case for clarity, but actual query keywords such as SELECT and WHERE are not really case sensitive.



Example 1 - COUNT DISTINCT values of specific field (uuid of clicks data)

q -H -t "SELECT COUNT(DISTINCT(uuid)) FROM ./clicks.csv"
      

Output 1:

229
      


Example 2 - Filter numeric data, controlling ORDERing and LIMITing output

Note that q understands that the column is numeric and filters according to its numeric value (real numeric value comparison, not string comparison).

q -H -t "SELECT request_id,score FROM ./clicks.csv WHERE score > 0.7 ORDER BY score DESC LIMIT 5"
      

Output 2:

2cfab5ceca922a1a2179dc4687a3b26e    1.0
        f6de737b5aa2c46a3db3208413a54d64    0.986665809568
        766025d25479b95a224bd614141feee5    0.977105183282
        2c09058a1b82c6dbcf9dc463e73eddd2    0.703255121794
      


Example 3 - Illustrate GROUP BY

q -t -H "SELECT hashed_source_machine,count(*) FROM ./clicks.csv GROUP BY hashed_source_machine"
      

Output 3:

47d9087db433b9ba.domain.com 400000
      


Example 4 - More complex GROUP BY (group by time expression)

q -t -H "SELECT strftime('%H:%M',date_time) hour_and_minute,count(*) FROM ./clicks.csv GROUP BY hour_and_minute"
      

Output 4:

07:00   138148
        07:01   140026
        07:02   121826
      


Example 5 - Read input from standard input

Calculates the total size per user/group in the /tmp subtree.

sudo find /tmp -ls | q "SELECT c5,c6,sum(c7)/1024.0/1024 AS total FROM - GROUP BY c5,c6 ORDER BY total desc"
      

Output 5:

mapred hadoop   304.00390625
        root   root     8.0431451797485
        smith  smith    4.34389972687
      


Example 6 - Use column names from header row

Calculate the top 3 user ids with the largest number of owned processes, sorted in descending order.

Note the usage of the autodetected column name UID in the query.

ps -ef | q -H "SELECT UID,COUNT(*) cnt FROM - GROUP BY UID ORDER BY cnt DESC LIMIT 3"
      

Output 6:

root 152
        harel 119
        avahi 2
      


Example 7 - JOIN two files

The following command joins an ls output (exampledatafile) and a file containing rows of group-name,email (group-emails-example) and provides a row of filename,email for each of the emails of the group. For brevity of output, there is also a filter for a specific filename called ppp which is achieved using a WHERE clause.

q "SELECT myfiles.c8,emails.c2 FROM exampledatafile myfiles JOIN group-emails-example emails ON (myfiles.c4 = emails.c1) WHERE myfiles.c8 = 'ppp'"
      

Output 7:

ppp dip.1@otherdomain.com
        ppp dip.2@otherdomain.com
      

You can see that the ppp filename appears twice, each time matched to one of the emails of the group dip to which it belongs. Take a look at the files exampledatafile and group-emails-example for the data.

Column name detection is supported for JOIN scenarios as well. Just specify -H in the command line and make sure that the source files contain the header rows.