Curate like a Boss - A Beginner's Guide to Querying SteemSQL (by a complete SQL beginner)

curate like a boss2.jpg
Tutorial and example SQL queries for: @curie curation; returning posts from multiple tags/categories; and following "favorite authors".

To the Curie Curators


The @curie guidelines are really the impetus behind this post. When I first began looking into what it would take to become a @curie curator I quickly realized that it would take a lot of work and time! With no efficient way to filter out new posts by the @curie guidelines, finding a quality post that fits the bill is a lot like the proverbial search for a needle in a haystack. So I started teaching myself enough SQL to build a query that would do the job. I am still working on a few things but I think what I have provided here is a decent start, and as far as I know it is the closest thing out there to a Curie customized advanced search. I am sharing this with you in the Steemit spirit of giving. At the end of this post I talk about what I am still trying to add to this query - if you have a better grasp of SQL than me (and if you have any experience with SQL at all, you probably do) and want to help me get this query over the remaining hurdles, please DM me (@carlgnash on steemit.chat / @gnashster on discord chat).

To the technophobes


Some parts of this post may look intimidating, particularly the SQL queries themselves. I want to break down the process of using one of the pre-built queries for you. You will have to navigate downloading a freeware software program and filling in a few fields to connect the software to the SteemSQL database, with step-by-step instructions below. Then you will just paste one of the SQL queries I provide below into the software and press "execute". You will receive a list of all posts that meet the search criteria in return. That is it. There is absolutely no special knowledge of coding or anything technically demanding about using a pre-built SQL query to filter new posts.

Some thanks!

First many thanks to @arcange for creating and hosting the publicly accessible SteemSQL database and for help with my queries, and @drakos and @justyy for helping with my queries as well. Another thanks goes out to @stoodkev for pointing out a potential solution to the problem of filtering out posts that do not contain any English characters in the post body - check out @stoodkev / @steem-plus' Steemit browser extension SteemPlus - among other things it lets you filter (by tag/resteem/reputation) and sort the posts (by time/payout/votes) from your own feed.

Last but not least I want to thank @misterakpan for recommending me as a @curie curator and spending a lot of time talking over the curation process and the custom @curie sql query with me. You are truly a gentleman and a scholar, and I am proud to call you my friend.

Background

I have found myself wondering many times why the search function within Steemit.com is so lacking. Why isn't there an advanced search option with filters that you can apply? There are a few 3rd party tools out there on various websites but none that I have found with truly customizable search features. A while back I ran into Arcange's SteemSQL database and told myself I needed to learn SQL to be able to query the database myself and get what I wanted.

Just this past week I started diving into SQL tutorials online and looking at SQL queries built by other users to crib something together. It turns out SQL is pretty intuitive, at least on the basic level, and I was able to have some success. As an absolute SQL neophyte myself, I figured who better to write a guide for SQL beginners?

What is the SteemSQL database?


A publicly accessible SQL database that contains the information from the Steem blockchain and structures the information in a manner that allows fast querying by any application able to connect to a SQL server. The main benefits are twofold; first, SQL database queries are executed much faster than queries to the blockchain itself; and second, no real programming knowledge is needed to interface with the SQL database.

How do I connect to the SteemSQL database?


If you have Microsoft Excel you can connect to the SteemSQL database with Excel, and @arcange has provided step by step instructions on how to do this here (just stop following his tutorial at the step where he starts talking about creating charts out of the results of the query ). Assuming you do not have Excel, or don't want to use it for this purpose, you can use any number of freeware applications. I personally use LINQPad 5 and the rest of this tutorial will explain how to download LINQPad, connect it to the SteemSQL database and execute SQL queries to return results.

STEP ONE - Download and install LINQPad

  • Download LINQPad 5 here and install (no you do not have to pay for the deluxe version, the freeware version will do everything we need so just select continue with the free installation).

    STEP TWO - Connect LINQPad to SteemSQL


    Once LINQPad has installed, open it and select the "Add connection" option at the top left:

Fill in the information in the "Add connection" pop-up window exactly as follows:

  • Select "LINQ to SQL" and click "Next" (this should be the default as shown below)
  • At the next screen you will enter the following information where I have highlighted yellow, then press "OK":
    Server: sql.steemsql.com
    User: steemit
    Password: steemit
    Database name: DBSteem

STEP THREE - Execute a Query!


Pick one of the sample queries provided below and copy it into your clipboard. Paste the query into the LINQPad window where I have highlighted the text "PASTE YOUR QUERY HERE!". You will also need to select the query language (SQL) and connection (the connection you set up at the last step) from the drop down menus where I highlighted yellow:

Now just press the play icon (execute) or press the F5 key:

STEP FOUR - Enjoy the fruits of your query


It will take a few seconds for the query to run, then the results will appear below the query box:

STEP FIVE - Save your query for future use!


Before you close out of LINQPad, save the query you executed so you can use it in the future without having to paste it in again. File / Save, Ctrl-S or right-click and save. Once you have saved a query, it will appear under "My queries" toward the bottom left of the LINQPad window (circled in pink below). Simply click on a saved query to load it.

Example Queries


For each of the example queries below I will first present the default query then I will discuss what the various parts of the SQL query do and how to customize it. There are three lines present in all the queries that I will go over here for all:

dirty = 'False' and

This line excludes all posts that have been flagged. If you do not want to exclude flagged posts, remove this line.

 parent_author = '' and

This line excludes comments. If you want to include comments as well as posts in your results, remove this line.

ORDER BY
    created desc

This line orders the results by post age with newest posts first. "desc" stands for "descending"; if you want to see oldest posts first change that to "asc" (for "ascending") like so:

ORDER BY
    created asc

You can choose other things from the queries to order the post by, using asc or desc to specify ascending or descending order for whatever you choose. For instance, to order posts by pending payout from least to most would be:

ORDER BY
     pending_payout_value asc

Example @curie Curation Query

SELECT
    'https://steemit.com'+ url, body,
    pending_payout_value, created,
    LEN(body) as CharacterCount,
    CONVERT(int,(SELECT MAX(v) FROM (VALUES(log10(ABS(CONVERT(bigint,author_reputation)-1)) - 9),(0)) T(v)) * SIGN(author_reputation) * 9 + 25) as rep
FROM 
    Comments (NOLOCK)
WHERE
   dirty = 'False' and
   parent_author = '' and
   datediff(minute, created, GETDATE()) between 150 and 24*60 and
   CONVERT(int,(SELECT MAX(v) FROM (VALUES(log10(ABS(CONVERT(bigint,author_reputation)-1)) - 9),(0)) T(v)) * SIGN(author_reputation) * 9 +25) between 27 and 52 and
   pending_payout_value < 1.0000 and
   LEN(body) > 4000
ORDER BY
    created desc

This query will return all new posts aged between 150 minutes and 24 hours, by authors with REP between 27 and 52, with pending post payout value less than $1, with no flags and with more than 4000 characters in the post body (including HTML). Things you can customize:

 datediff(minute, created, GETDATE()) between 150 and 24*60 and

This line specifies the post age between 150 minutes and 24 hours. If you want to include posts aged as young as 45 minutes you would change the "150" to a "45" - if you want to include all posts up to one day old you would change the "150" to a "0" (posts aged 0 to 24 hours).

 CONVERT(int,(SELECT MAX(v) FROM (VALUES(log10(ABS(CONVERT(bigint,author_reputation)-1)) - 9),(0)) T(v)) * SIGN(author_reputation) * 9 +25) between 27 and 52 and

This scary looking line specifies author REP between 27 and 52. Look for the numbers 27 and 52 at the end of the line - change these numbers to change the REP filter.

pending_payout_value < 1.0000 and

This line is pretty self-explanatory and specifies pending post payout value less than $1. Adjust the number after the < symbol to change this but make sure to keep the number in X.XXXX format - e.g. pending_payout_value <0.5000 would only return posts with less than $.50 pending post payout.

LEN(body) > 4000

This line specifies only posts with more than 4000 characters in the post body. Change the number after the > symbol to adjust this filter. Normally 4000 characters is about 500 words, but remember that the query will count the HTML characters that format the post body as well; in my experience having used this query for a few days I think 4000 is a good minimum. If you do not want to filter by post length at all just change the "4000" to a "0" to include all posts.

Example Query for Multiple Tags/Categories

SELECT
    'https://steemit.com'+ url, body,
    pending_payout_value, created,
    CONVERT(int,(SELECT MAX(v) FROM (VALUES(log10(ABS(CONVERT(bigint,author_reputation)-1)) - 9),(0)) T(v)) * SIGN(author_reputation) * 9 + 25) as rep
FROM 
    Comments (NOLOCK)
WHERE
   dirty = 'False' and
   parent_author = '' and
   category IN ('poetry', 'fiction', 'writing') and 
   datediff(hour, created, GETDATE()) between 0 and 7*24 and
   CONVERT(int,(SELECT MAX(v) FROM (VALUES(log10(ABS(CONVERT(bigint,author_reputation)-1)) - 9),(0)) T(v)) * SIGN(author_reputation) * 9 +25) between 25 and 100 and
   pending_payout_value < 9000.0000
ORDER BY
    created desc

This query as written will return all posts from the past week that have at least one of the following tags: #poetry; #fiction; #writing, from authors with 25+ rep. Insert your own tags/categories by editing this line:

category IN ('TAG1', 'TAG2', 'TAG3', 'TAG4ETC') and 

Make sure you have single quotes around each tag ('example') and include a comma after each tag except the last tag.

  datediff(hour, created, GETDATE()) between 0 and 7*24 and

This line specifies post age - this is in hours, so to see posts between 6 and 12 hours old you would adjust this to:

  datediff(hour, created, GETDATE()) between 6 and 12 and

You can change the time unit to days like so (the below would return all posts aged 0 to 30 days):

  datediff(day, created, GETDATE()) between 0 and 30 and


CONVERT(int,(SELECT MAX(v) FROM (VALUES(log10(ABS(CONVERT(bigint,author_reputation)-1)) - 9),(0)) T(v)) * SIGN(author_reputation) * 9 +25) between 25 and 100 and

This line controls the author REP parameter - as written this would return posts by authors with REP between 25 and 100, which effectively means REP 25+. If you want to specify a different range of author REP just adjust the numbers at the end of this line, e.g. to only see posts by authors with REP between 25 and 50:

CONVERT(int,(SELECT MAX(v) FROM (VALUES(log10(ABS(CONVERT(bigint,author_reputation)-1)) - 9),(0)) T(v)) * SIGN(author_reputation) * 9 +25) between 25 and 50 and


pending_payout_value < 9000.0000

This line specifies only posts with pending payout less than $9000 - so effectively this is including everything. I included this line so you could adjust it if you did want to filter for posts by pending payout. If you change this number make sure to keep it in the format X.XXXX with four decimal places. To only include posts with less than $5 payout:

pending_payout_value < 5.0000

To included posts with pending payout between $.50 and $1:

pending_payout_value between 0.5000 and 1.0000

Example Query to Follow Favorite Posters

SELECT
    'https://steemit.com'+ url, body,
    created
FROM 
    Comments (NOLOCK)
WHERE
   parent_author = '' and
   datediff(hour, created, GETDATE()) between 0 and 7*24 and
   author IN ('crimsonclad', 'vachemorte', 'buckydurddle', 'yusaymon', 'juliakponsford', 'stickchumpion', 'stitchybitch', 'jrhughes', 'clayboyn', 'aggroed', 'sircork', 'sammosk', 'spaingaroo', 'tremendospercy', 'gmuxx', 'misterakpan')
 
ORDER BY
    created desc

This query returns all posts created in the past week by some of my favorite Steemit posters - give it a run and enjoy some terrific posts! You can thank me later :) I wrote this query because I have always wished that Steemit had a way to "favorite" someone you were following so that you would always see their posts - I am adding my favorites to this query so I can run it and see all their recent posts. To my friends, don't be offended if you don't see yourself in the above query - it is most likely operator error, I am still adding folks to this list and I know I am still forgetting some people that I definitely want to be on there!

To create your own version of a favorite poster query, just edit this line to include your own favorite posters. Do NOT include the "@" symbol in front of the user name, make sure to enclose each name in single quotes (like so: 'username') and remember to put a comma after each name except for the last one:

author IN ('carlgnash', 'favoriteposter2', 'favoriteposter3ETC')


  datediff(hour, created, GETDATE()) between 0 and 7*24 and

This line specifies post age - this is in hours, so to see posts between 6 and 12 hours old you would adjust this to:

  datediff(hour, created, GETDATE()) between 6 and 12 and

You can change the time unit to days like so (the below would return all posts aged 0 to 30 days):

  datediff(day, created, GETDATE()) between 0 and 30 and

You can remove this line entirely if you want to see all posts by your favorite authors.

What I am still working on for the @curie query


There are two more big things I want to accomplish with the @curie curation query:

  • Filter by sum of past week's pending author payout (e.g. only include posts where the author has less than $25 pending post payout for all posts in the past week). This would both take care of the requirement that the author not have received a @curie vote in the past week, and would help exclude authors who are making too much money on their posts to qualify as "persistent without much success". I don't think this should be terribly hard but I am still figuring out how to nest the WHERE statements properly.
  • Filter out posts that do not contain any English characters in the post body (to satisfy the English language requirement). This one is kind of tricky as the HTML that formats the post body is included in the post body column in the SteemSQL database. @stoodkev pointed out that I could use regex to strip the HTML and other formatting characters out first, then exclude the post if no English characters remain in post body. @stoodkev gave me the regex I will need to accomplish this, but I still need to build it into the query and test it.

I am not kidding when I say that I literally just started teaching myself SQL within the past week. I learn quickly and have already grasped a lot compared to where I was at when I first poked my head in the SteemSQL help chat on steemit.chat... but I am still very much a SQL novice. If you are a SQL master and can help me take care of these last two items and add them into the @curie query I would certainly be grateful for any help!

Much love - Carl




"Upvote Follow Resteem" lettering by @dillemma | color by @carlgnash

H2
H3
H4
3 columns
2 columns
1 column
80 Comments