sql - Poor performance on Amazon Redshift queries based on VARCHAR size -
i'm building amazon redshift info warehouse, , experiencing unexpected performance impacts based on defined size of varchar column. details follows. 3 of columns shown pg_table_def:
schemaname | tablename | column | type | encoding | distkey | sortkey | notnull ------------+-----------+-----------------+-----------------------------+-----------+---------+---------+--------- public | logs | log_timestamp | timestamp without time zone | delta32k | f | 1 | t public | logs | event | character varying(256) | lzo | f | 0 | f public | logs | message | character varying(65535) | lzo | f | 0 | f
i've run vacuum , analyze, have 100 1000000 rows in database, , i'm seeing different performance depending on columns include.
query 1: instance, next query takes 3 seconds:
select log_timestamp logs order log_timestamp desc limit 5;
query 2: similar query asking more info runs in 8 seconds:
select log_timestamp, event logs order log_timestamp desc limit 5;
query 3: however, query, similar previous, takes 8 minutes run!
select log_timestamp, message logs order log_timestamp desc limit 5;
query 4: finally, query, identical slow 1 explicit range limits, fast (~3s):
select log_timestamp, message logs log_timestamp > '2014-06-18' order log_timestamp desc limit 5;
the message
column defined able hold larger messages, in practice doesn't hold much data: average length of message field 16 charachters (std_dev 10). average length of event field 5 charachters (std_dev 2). distinction can see max length of varchar field, wouldn't think should have order of magnitude impact on time simple query takes return!
any insight appreciated. while isn't typical utilize case tool (we'll aggregating far more we'll inspecting individual logs), i'd understand subtle or not-so-subtle affects of table design.
thanks!
dave
redshift "true columnar" database , reads columns specified in query. so, when specify 2 little columns, 2 columns have read @ all. when add together in 3rd big column work redshift has dramatically increases.
this different "row store" database (sql server, mysql, postgres, etc.) entire row stored together. in row store adding/removing query columns not create much difference in response time because database has read whole row anyway.
finally reason lastly query fast because you've told redshift can skip big portion of data. redshift stores each column in "blocks" , these blocks sorted according sort key specified. redshift keeps record of min/max of each block , can skip on blocks not contain info returned.
the limit clause doesn't cut down work has done because you've told redshift must first order all log_timestamp descending. the problem order … desc has executed on entire potential result set before info can returned or discarded. when columns little that's fast, when they're big it's slow.
sql amazon-redshift
No comments:
Post a Comment