Curate Like a Boss pt.2 - A Beginner's Guide to Querying SteemSQL (by a complete SQL beginner)

curate like a boss2.jpg

TOPICS: Curie Curation Query v1.1; example queries for art and music curation; include/exclude multiple tags (not just 1st tag/category); keyword search in post body; how to customize results columns.

EDIT READ ME

There seems to be a server side issue with the SteemSQL database that is preventing the queries in this post from working as intended RE filtering for post age. I have isolated this issue to the SteemSQL database by stripping out everything from a query except the datediff statement and SteemSQL is consistently returning posts 2 hours younger than the datediff statement specifies as the youngest age. Example query here that is returning posts aged 30 minutes when it should return posts aged minimum 150 minutes:

SELECT
created, 'https://steemit.com'+ url, title, body
FROM
    Comments (NOLOCK)
WHERE
    parent_author = '' and
    datediff(minute, created, GETDATE()) between 150 and 200
ORDER BY
    created desc

I have submitted a note to this effect in the SteemSQL help channel on steemit.chat. I will keep this post updated and remove this notice if and when the issue is resolved. For now, you can just scroll down through the result set until you reached posts aged as old as you want to start your curation from, or adjust the datediff statement to take the 2 hour difference into account. Sorry for any hassle! Cheers - Carl

/EDIT






Okay folks this is week two of my SQL deep dive. I am coming up for air briefly to share with you what I have learned. My primary target audience here remains Steemit curators of all stripes, but anyone interested in learning how to query the Steemit blockchain for data analysis may find something useful here. I do not intend this series as a full SQL tutorial; there are plenty of great SQL tutorials online that you can reference for detailed explanations of all SQL operators and syntax. My goal is to explain just enough that by providing example queries for you to start with, you can begin to customize portions of them and query SteemSQL on your own.

What the heck are you talking about? Where do I begin?


Check out part one for the basics, covering what is SteemSQL, how to connect to the SteemSQL database and how to execute a SQL query. Don't be scared, SQL won't bite - no matter how scary some of the example queries below may look ;)

To the Curie Curators


This is also week two on the job for me as a @curie curator - the entire impetus behind me teaching myself SQL and figuring out how to query the SteemSQL database was curating for Curie more efficiently. I am a stay-at-home dad with two sons under the age of three; unmolested time on the computer is a luxury and I realized I needed a way to filter new posts by the Curie guidelines if I was going to have any chance of succeeding as a Curie curator. I have found SQL query to be immensely helpful to me personally in this regard and I hope it helps you as well. I encourage you to DM me on steemit.chat (same username) or Discord chat (@gnashster) if you have comments or suggestions on my Curie query. May the reviewers be always in your favor :)

To Curators of All Stripes


If you are curating a particular corner of Steemit that I am not covering here in this post, I would absolutely love to hear from you in the comments. What kind of filtering ability would help you while searching for new posts in your particular area of specialty? What tags are you looking in? What keywords would be helpful? I will be happy to make a query for you and include it (with a shoutout to you) in a future post in this series. I play well with others - give it a try, I am not going to say that I won't bite, but I promise I will only bite if you ask me to and it is 100% clear that this is a consensual activity :)

Curate Like a Boss part one ERRATA


I left off an important step in the first part of this tutorial series. After pasting in a query to LINQPad and running it (or customizing it), you should save the query for future use so you do not have to paste it in again or customize it again. I have edited the first part of this series to include this step, but I am including it here as well for reference for those who have already read the first post previously:

Before you close out of LINQPad, save the query you executed so you can use it in the future without having to paste it in again. File / Save (or Save as), Ctrl-S or right-click on query name (toward top right of window) and save (or Save as). Once you have saved a query, it will appear under "My queries" toward the bottom left of the LINQPad window (circled in pink below). Simply click on a saved query to load it.


Curie Query v1.1


What's new?

  • Including/Excluding Tags
  • Excluding Authors
  • Keyword Search in Post Body
  • Standalone past-week author pending post payout query

By default, Curie Query v1.1 will return all new posts aged 150 minutes to 24 hours, by authors with REP 27-52, with pending post payout < $1.00, and post character count > 4000 (this includes html characters). Categories likely to yield foreign language posts, memes, religious posts and Steemit related posts have been excluded, along with introduceyourself posts.

NOTE: If you add a bunch of excluded tags and/or authors to your query, you might start experiencing errors when LINQPad attempts to execute your query. It may serve up a "syntax" error. If this occurs, just run the query again and it should work (eventually LOL). If you do experience repeated syntax errors, check the SQL query where the error specifies to make sure you didn't mess something up, like forgetting single quotes around an author name, accidentally deleting the "or" or "and" statement at the end of a line, or accidentally including an "or" or "and" statement at the end of the last line. If every thing looks okay it just might be having a hard time with the query.

SELECT
    'https://steemit.com'+ url, body,
    pending_payout_value, created,
    LEN(body) as CharacterCount,
    CONVERT(int,(SELECT MAX(v) FROM (VALUES(log10(ABS(CONVERT(bigint,author_reputation)-1)) - 9),(0)) T(v)) * SIGN(author_reputation) * 9 + 25) as rep
FROM 
    Comments (NOLOCK)
WHERE
    (
    NOT json_metadata LIKE('%"introduceyourself"%') and
    NOT json_metadata LIKE('%"deutsch"%') and
    NOT json_metadata LIKE('%"miprimerconcurso"%') and
    NOT json_metadata LIKE('%"cn"%') and
    NOT json_metadata LIKE('%"ru"%') and
    NOT json_metadata LIKE('%"spanish"%') and
    NOT json_metadata LIKE('%"polish"%') and
    NOT json_metadata LIKE('%"myanmar"%') and
    NOT json_metadata LIKE('%"faith"%') and
    NOT json_metadata LIKE('%"christian-trail"%') and
    NOT json_metadata LIKE('%"meme"%') and
    NOT json_metadata LIKE('%"christianity"%') and
    NOT json_metadata LIKE('%"religion"%') and
    NOT json_metadata LIKE('%"god"%') and
    NOT json_metadata LIKE('%"bible"%')
) and 
   dirty = 'False' and
   parent_author = '' and
   NOT author IN ('excludedauthor1', 'excludedauthor2', 'excludedauthoretc') and
   datediff(minute, created, GETDATE()) between 150 and 24*60 and
   CONVERT(int,(SELECT MAX(v) FROM (VALUES(log10(ABS(CONVERT(bigint,author_reputation)-1)) - 9),(0)) T(v)) * SIGN(author_reputation) * 9 +25) between 27 and 52 and
   pending_payout_value < 1.0000 and
   LEN(body) > 4000
ORDER BY
    created desc

Customizing Curie Query v1.1


Only the new portions of the query are covered here; refer to tutorial part 1 for customizing the rest of the query.

Excluding Tags from the Query

WHERE
    (
    NOT json_metadata LIKE('%"introduceyourself"%') and
    NOT json_metadata LIKE('%"deutsch"%') and
    NOT json_metadata LIKE('%"miprimerconcurso"%') and
    NOT json_metadata LIKE('%"cn"%') and
    NOT json_metadata LIKE('%"ru"%') and
    NOT json_metadata LIKE('%"spanish"%') and
    NOT json_metadata LIKE('%"polish"%') and
    NOT json_metadata LIKE('%"myanmar"%') and
    NOT json_metadata LIKE('%"faith"%') and
    NOT json_metadata LIKE('%"christian-trail"%') and
    NOT json_metadata LIKE('%"meme"%') and
    NOT json_metadata LIKE('%"christianity"%') and
    NOT json_metadata LIKE('%"religion"%') and
    NOT json_metadata LIKE('%"god"%') and
    NOT json_metadata LIKE('%"bible"%')
) and 

You will notice right off the bat that this version of the query has a long ugly series of AND statements nested inside parentheses at the beginning of the WHERE statement. Unfortunately this is needed to exclude any tag, rather than just excluding category/first tag. Tags 2-5 of a post are only contained in the "json_metadata" column, which contains all 5 tags as well as some other random miscellanies such as the URLS of any links your post contains and even some images for reasons I do not fully understand yet. This causes some problems:

  • Because each tag is not in its own column, we cannot use the IN operator to neatly add all the tags we want to exclude at once (see "Excluding Authors" below for an example of the IN operator in action).
  • The "json_metadata" column is not full text searchable, unlike most columns in the SteemSQL database, so we cannot use the various FULLTEXT operators to exclude words (see "Add or Remove Keywords" under "Example Art and Music Curation Queries" below for an example of the FREETEXT FULLTEXT operator in action).
  • The solution I have arrived at is to use the LIKE operator in combination with wildcard characters, stringing together multiple NOT / LIKE statements with AND statements. The parentheses around this section are not strictly necessary but I think it helps to make it more obvious that this is a discrete section of the query. This is also an example of how SQL query can be kind of clunky if a column is not indexed for fulltext search. If you want to exclude more tags, simply copy paste the following line into the beginning of the list of NOT statements inside the parentheses, swapping out "exampletag" for your real excluded tag(s).
NOT json_metadata LIKE('%"exampletag"%') and

Of course if you decide you want to include some of the tags I have excluded by default in the query, simply remove the offending lines. Just make sure that every line has an "and" after it except the very last NOT / LIKE statement inside the parentheses.

A few more notes:

  • LIKE tells SQL to look for characters within the string (all the characters contained in a given row of a given column are known collectively as a "string" in SQL parlance) that match what you specify. The wildcard character "%" stands for any number of characters. The tags contained in the json_metadata are couched in double quotes (see screencap below for json_metadata from a post containing the tags "art", "drawing" and "creativity"):
  • The reason I include the double quotes around each tag in my query is to prevent false positive matches. As an example consider if you did not include the double quotes and added this line to exclude all posts with the tag "art" (not sure why you would ever want to do this LOL):
NOT json_metadata LIKE('%art%') and

The use of the two wildcard characters "%" on either side of "art" means that this would not only exclude the "art" tag, but also "originalart", "artexplosion", "fart", "smart", etc. - it would exclude any tag that contained the sequential letters "art" no matter where they occurred in the tag, because the "%" wildcard characters stand for any number of characters before or after the "art". By including the double quotes from the json_metadata around our tag, we will avoid this.

Excluding Authors


I have a blacklist of authors I have excluded from the version of this query that I run myself. I decided against sharing my own blacklist here as I don't want to ruffle any feathers. If you find yourself repeatedly coming across the same authors creating posts which make it past the >4000 character filter but would never yield a Curie submission, add the author names into the query where it says "excludedauthor1", "excludedauthor2", etc.

NOT author IN ('excludedauthor1', 'excludedauthor2', 'excludedauthoretc') and

A few things to note here:

  • NOT does what it says on the tin; whatever follows the NOT operator is excluded from the query. In this case we specify the "author" column and use the IN operator (which tells SQL to expect a comma separated chain of statements inside parentheses to follow) to provide a list of specific author names to exclude.
  • Make sure you do not include the "@" symbol in the author name
  • Make sure you enclose each author name in single quotes ('exampleauthorname')
  • Make sure you have a comma after every author name except the last author name

Only returning posts with specified tags / keywords


By default, Curie Query returns posts from all tags except tags that are specifically excluded. If you would rather curate from a list of tags that you believe are likely to yield submit-able posts, or possibly refine your search farther with keywords, here is an example query. This is NOT ready to use as is - you will have to customize the query to provide at least one actual Steemit tag in order to return any results. See below under Example Art Curation Query for explanations of how to customize the specified tags / keywords. Remove the FREETEXT line completely if you do not want to only return posts with certain keywords. I have also included the lines to easily exclude specified tags here as well (@misterakpan you can exclude #colorchallenge here LOL):

SELECT
    'https://steemit.com'+ url, body,
    pending_payout_value, created,
    LEN(body) as CharacterCount,
    CONVERT(int,(SELECT MAX(v) FROM (VALUES(log10(ABS(CONVERT(bigint,author_reputation)-1)) - 9),(0)) T(v)) * SIGN(author_reputation) * 9 + 25) as rep
FROM 
    Comments (NOLOCK)
WHERE
    (
    json_metadata LIKE('%"exampletag1"%') or
    json_metadata LIKE('%"exampletag2"%') or
    json_metadata LIKE('%"exampletag3"%') or
    json_metadata LIKE('%"exampletagetc"%')
    ) and 
 (
    NOT json_metadata LIKE('%"excludedtag1"%') and
    NOT json_metadata LIKE('%"excludedtag2"%') and
    NOT json_metadata LIKE('%"excludedtag3"%') and
    NOT json_metadata LIKE('%"excludedtagetc"%') 
) and 
   dirty = 'False' and
   parent_author = '' and
   NOT author IN ('excludedauthor1', 'excludedauthor2', 'excludedauthoretc') and
   FREETEXT(Body, 'keyword1 keyword2 keyword etc') and
   datediff(minute, created, GETDATE()) between 150 and 24*60 and
   CONVERT(int,(SELECT MAX(v) FROM (VALUES(log10(ABS(CONVERT(bigint,author_reputation)-1)) - 9),(0)) T(v)) * SIGN(author_reputation) * 9 +25) between 27 and 52 and
   pending_payout_value < 1.0000 and
   LEN(body) > 4000
ORDER BY
    created desc

Standalone past-week author pending post payout query

SELECT
    pending_payout_value
FROM
    Comments
WHERE
    author IN ('authorname') and
    parent_author = '' and
    datediff(day, created, GETDATE()) between 0 and 7
ORDER BY
    pending_payout_value desc

Keep this query open in LINQPad while you are using the main query to search for posts. If you find a promising post, paste the post author's name into this query where it says "authorname" (do NOT include the "@" symbol). This query returns the past week's pending post payouts for the specified author, ordered from greatest to least. If the top item in the result set is <$25, the author qualifies under the guideline that no single post from the author can have >$25 pending post payout. See screencap for an example using my own username - my highest past week pending post payout is only $16.2660 so I would qualify under this Curie guideline (although my REP is too high for Curie, more's the shame):

Example Art and Music Curation Queries


Both of these queries use the same basic structure. They include posts from the specified tag(s) and containing the specified keywords. I have included lines that will allow filtering by author rep, post age and pending post payout, but have set these relatively wide open by default to return more posts. Including these lines allows you to easily customize these parameters. I will provide both queries below and go over how to customize them.

Example Art Curation Query

SELECT
    'https://steemit.com'+ url, title, body, json_metadata
FROM
    Comments (NOLOCK)
WHERE
(
    json_metadata LIKE('%"art"%') or
    json_metadata LIKE('%"originalart"%') or
    json_metadata LIKE('%"drawing"%') or
    json_metadata LIKE('%"painting"%') or
    json_metadata LIKE('%"sculpture"%')
    ) and
    NOT author IN('vertical', 'trafficmonitor', 'markboss') and
    FREETEXT(Body, 'process method') and
    dirty = 'False' and
   parent_author = '' and
   datediff(minute, created, GETDATE()) between 30 and 5*24*60 and
   CONVERT(int,(SELECT MAX(v) FROM (VALUES(log10(ABS(CONVERT(bigint,author_reputation)-1)) - 9),(0)) T(v)) * SIGN(author_reputation) * 9 +25) between 25 and 100 and
   pending_payout_value < 1000.0000
ORDER BY
    created desc

Customizing the Example Art Curation Query


By default this query will return posts aged between 30 minutes and 5 days that have any of the following tags: "art"; "originalart"; "drawing"; "painting"; "sculpture"; AND which also contain any form of either of two keywords: "process" and "method". The goal of including these keywords is to return posts that include information about the artist's process; this will of course only work if the post author used either of the words "process" or "method". It works decently well in my experience, but definitely misses some art process posts that do not use these words. Flagged posts and posts by authors with <25 REP are excluded by the query by default. I have also excluded posts by three authors that regularly publish compilation posts of recent/top performing posts in various categories, which include art categories.

You can customize many parts of this query as follows:

  • Add or remove specified tags:
WHERE
(
    json_metadata LIKE('%"art"%') or
    json_metadata LIKE('%"originalart"%') or
    json_metadata LIKE('%"drawing"%') or
    json_metadata LIKE('%"painting"%') or
    json_metadata LIKE('%"sculpture"%')
    )

The above is the section of the query that specified which tags to include. To remove a tag from this list, simply remove the entire line. The very last tag should not have an "or" at the end, so if you remove the "sculpture" tag line, you will need to delete the "or" from the new last tag. To add more tags, just insert the desired tag in the following line of code where it says "tagname" and paste the line in at the top of the list of json_metadata LIKE statements:

json_metadata LIKE('%"tagname"%') or

A few notes here:

  • If you want to make a more exclusive query and only return posts that use ALL of the tags you specify, change the "or" after each json_metadata LIKE line to an "and". As written above with the "or" operators a post will be included if it contains any of the listed tags; if you change the "or"s to "and"s, a post will only be returned if it contains all of the listed tags.

  • In actual practice it might not make a lot of sense to include the general "art" tag along with the more specific tags like "drawing", "painting", etc. Presumably almost everyone who uses a more specific art tag will also use the general art tag, but this might not be the case so I included "art" anyway.

  • Add or remove keywords:

FREETEXT(Body, 'process method') and

The "Body" column is full text search enabled in the SteemSQL database, so we can use on of SQLs nifty FULLTEXT operators, the very handy FREETEXT operator. FREETEXT looks for any variation of any word contained inside the single quotes inside the parentheses (in this case, looking for variations of the words "process" and/or "method"). If you do not want to include any keywords in your search, simply delete this entire line. If you want to change or add keywords, just change them accordingly inside the single quotes. Each keyword should just be separated from other keywords by a space. This is an incredibly useful and easy way to narrow down the overwhelmingly large number of results that will return from a query that includes all posts with the tag "art".

  • Including posts that have been flagged
dirty = 'False' and

If you want to include posts that have been flagged, just delete the entire line above from the example query.

  • Changing the post age filters:
 datediff(minute, created, GETDATE()) between 30 and 5*24*60 and

By default, minimum post age is set to 30 minutes to maximize your curation return. Change the "30" to a "0" if you want to return posts that have been newly created. By default, maximum post age is set to 5 days. Because the unit we are using here is minutes (to allow 30 minute minimum post age), 5 days would have been... 7200 minutes. 7200 minutes is not a particularly intuitive way to express 5 days, so I expressed the max age as the formula "52460". There are 60 minutes in an hour and 24 hours in a day; to change the number of days the max post age is set at, all you have to do is change the first digit to reflect the desired number of days (e.g. 12460 for max age of 1 day, 22460 for max age of 2 days, etc.).

  • Changing the author REP filter:
CONVERT(int,(SELECT MAX(v) FROM (VALUES(log10(ABS(CONVERT(bigint,author_reputation)-1)) - 9),(0)) T(v)) * SIGN(author_reputation) * 9 +25) between 25 and 100 and

By default this query will return all posts by authors with 25+ REP (more accurately, with REP between 25 and 100, but there are no accounts on Steemit with 100 REP nor are there likely to be any time soon). It is pretty intuitive to change these settings - just adjust the numbers "25" and "100" to form the new REP range. For example, if you only want to include posts by authors with REP between 25 and 40, change the "100" to a "40" to make "40" the new upper limit of the REP filter.

  • Changing the pending post payout filter:
pending_payout_value < 1000.0000

By default this will return all posts with pending payout value < $1000... so pretty much all posts that meet the other query criteria. Just change the number after the "<" sign, remembering to keep the format X.XXXX (four decimal places and include a leading zero if you are setting a value less than 1). For example, to only include posts with less than $2 pending post payout, change the "1000.0000" to "2.0000"; to include posts with less than $.50 pending post payout, change the "1000.0000" to "0.5000", etc.

Example Music Curation Query

SELECT
    'https://steemit.com'+ url, title, body, json_metadata
FROM
    Comments (NOLOCK)
WHERE
(
    json_metadata LIKE('%"music"%') or
    json_metadata LIKE('%"originalmusic"%') or
    json_metadata LIKE('%"openmic"%') or
    json_metadata LIKE('%"acoustic"%') or
    json_metadata LIKE('%"song"%') or
    json_metadata LIKE('%"electronicmusic"%') 
    ) and
    NOT author IN('vertical', 'trafficmonitor', 'markboss') and
    FREETEXT(Body, 'original') and
    dirty = 'False' and
   parent_author = '' and
   datediff(minute, created, GETDATE()) between 30 and 5*24*60 and
   CONVERT(int,(SELECT MAX(v) FROM (VALUES(log10(ABS(CONVERT(bigint,author_reputation)-1)) - 9),(0)) T(v)) * SIGN(author_reputation) * 9 +25) between 25 and 100 and
   pending_payout_value < 1000.0000
ORDER BY
    created desc

By default this query will return posts aged between 30 minutes and 5 days that have any of the following tags: "music"; "originalmusic"; "openmic"; "acoustic"; "song"; "electronicmusic"; AND which also contain any form of the keyword "original" in the post body. The goal of including this keyword is to return original music posts; this will of course only work if the post author used the word "original" in the post body. Flagged posts and posts by authors with <25 REP are excluded by the query by default. I have also excluded posts by three authors that regularly publish compilation posts of recent/top performing posts in various categories, which include music categories.

See above under the example art curation query for how to customize this query - it works in the same fashion as the art query.

Customizing Results Columns


I want to cover something that I skipped in part 1 of this tutorial series for the sake of brevity. It is extremely easy to add, remove, or rearrange the columns of the result set that is returned by a query. The first part of every SQL query is the SELECT statement. This is a comma separated list, with every list item corresponding to a column in the results.

Example Customizing Results Query:

SELECT
    'https://steemit.com'+ url, title, body
FROM
    Comments (NOLOCK)
WHERE
    author = 'carlgnash'
ORDER BY
    created desc

Running this query will return every post and comment that I have made since I joined Steemit, from most recent to oldest. The first column in the result set (determined by the comma separated list after SELECT) will be a post/comment URL, the second column will be the associated post title (if a post), and the third column will be the associated post/comment body. See screencap below for what this query returned at 3:23am Pacific Time, 10/20/2017 - you are only looking at comments here as it has been a few days since my last proper post so you will not see any results in the "title" column below:

For our example we are querying the "Comments" table in the SteemSQL database (specified after the "FROM" statement in the sample query above) - I have circled the "Comments" table in pink in the screencap above. In the LINQPad window I have expanded the "Comments" table to reveal every column in this table. I have highlighted the columns yellow in the screencap above. You can add any column from the "Comments" table to the SELECT statement, separating each column with a comma, to add that column to the results set. So let's add "created", "depth" and "children" to our query (these are columns highlighted in yellow in the screencap above); I have inserted them into the beginning of the SELECT statement in the example query below, so that we can see their output in the screencap (inserting them into the beginning of the SELECT statement means they will be first in order of the results columns).

Example Customizing Results Query #2

SELECT
    created, depth, children, 'https://steemit.com'+ url, title, body
FROM
    Comments (NOLOCK)
WHERE
    author = 'carlgnash'
ORDER BY
    created desc

Here is a screencap of what the 2nd example query returned for me:

You can see that we have new columns for "created", "depth" and "children". "Created" is pretty self explanatory, that is when the post/comment was first authored. "Depth" tells you if this is a blog post (depth 0), a reply to a blog post (depth 1), a reply to a reply (depth 2) a reply to a reply to a reply (depth 3), etc. "Children" tells you how many replies this post/comment has spawned.

I wish I could tell you that there was some amazing reference that tells you exactly what each of the columns in the SteemSQL database is - but if such a thing exists, I have not found it. I have figured stuff out by trial and error, but a very helpful hint is to just include a column in the SELECT statement if you are not sure what it is; looking at the output column often gives you a clue as to what it is. For instance, "json-metadata" sounds very mysterious and I had no clue what it was. Turns out, if you include "json_metadata" you are going to return all the tags of a post along with some other stuff - very useful! Let's add "json_metadata" to the beginning of the SELECT statement in our example query and run it:

Example Customizing Results Query #3

SELECT
    json_metadata, created, depth, children, 'https://steemit.com'+ url, title, body
FROM
    Comments (NOLOCK)
WHERE
    author = 'carlgnash'
ORDER BY
    created desc

Here is a screencap of what the above query returned for me:

Since the result set is all comments, these results all have just have one tag in the json_metadata (comments have the same tag as the first tag/category of the parent post). If there is a link in the post/comment, it is stored in the json_metadata as well (circled in pink in the sreencap above).

If you want to play around with making your own queries and move beyond what I have outlined so far in my tutorials, a great place to start would be exploring the SteemSQL database by adding columns to your SELECT statement and running queries - you can check out other tables besides the Comment table by changing the FROM statement in your query to any table that you can see in the LINQPad connection window under the Steem SQL connection.

If you get stuck feel free to DM me (same username on steemit.chat; @gnashster on Discord) and I will do my best to help.

Much love - Carl




"Upvote Follow Resteem" lettering by @dillemma | color by @carlgnash

H2
H3
H4
3 columns
2 columns
1 column
20 Comments