-H flag in the examples below signifies that the file has a header row which is used for naming columns.
-t flag is just a shortcut for saying that the file is a tab-separated file (any delimiter is supported - Use the
Queries are given using upper case for clarity, but actual query keywords such as SELECT and WHERE are not really case sensitive.
- Example 1 - COUNT DISTINCT values of specific field (uuid of clicks data)
- Example 2 - Filter numeric data, controlling ORDERing and LIMITing output
- Example 3 - Illustrate GROUP BY
- Example 4 - More complex GROUP BY (group by time expression)
- Example 5 - Read input from standard input
- Example 6 - Use column names from header row
- Example 7 - JOIN two files
q -H -t "SELECT COUNT(DISTINCT(uuid)) FROM ./clicks.csv"
q understands that the column is numeric and filters according to its numeric value (real numeric value comparison, not string comparison).
q -H -t "SELECT request_id,score FROM ./clicks.csv WHERE score > 0.7 ORDER BY score DESC LIMIT 5"
2cfab5ceca922a1a2179dc4687a3b26e 1.0 f6de737b5aa2c46a3db3208413a54d64 0.986665809568 766025d25479b95a224bd614141feee5 0.977105183282 2c09058a1b82c6dbcf9dc463e73eddd2 0.703255121794
q -t -H "SELECT hashed_source_machine,count(*) FROM ./clicks.csv GROUP BY hashed_source_machine"
q -t -H "SELECT strftime('%H:%M',date_time) hour_and_minute,count(*) FROM ./clicks.csv GROUP BY hour_and_minute"
07:00 138148 07:01 140026 07:02 121826
Calculates the total size per user/group in the /tmp subtree.
sudo find /tmp -ls | q "SELECT c5,c6,sum(c7)/1024.0/1024 AS total FROM - GROUP BY c5,c6 ORDER BY total desc"
mapred hadoop 304.00390625 root root 8.0431451797485 smith smith 4.34389972687
Calculate the top 3 user ids with the largest number of owned processes, sorted in descending order.
Note the usage of the autodetected column name UID in the query.
ps -ef | q -H "SELECT UID,COUNT(*) cnt FROM - GROUP BY UID ORDER BY cnt DESC LIMIT 3"
root 152 harel 119 avahi 2
The following command joins an ls output (exampledatafile) and a file containing rows of group-name,email (group-emails-example) and provides a row of filename,email for each of the emails of the group. For brevity of output, there is also a filter for a specific filename called ppp which is achieved using a WHERE clause.
q "SELECT myfiles.c8,emails.c2 FROM exampledatafile myfiles JOIN group-emails-example emails ON (myfiles.c4 = emails.c1) WHERE myfiles.c8 = 'ppp'"
ppp firstname.lastname@example.org ppp email@example.com
You can see that the
ppp filename appears twice, each time matched to one of the emails of the group dip to which it belongs. Take a look at the files
group-emails-example for the data.
Column name detection is supported for JOIN scenarios as well. Just specify
-H in the command line and make sure that the source files contain the header rows.